Optimizing expand dims & transpose operator/ accessing source of fused operator

Hello,

I encountered a situation with a model which upon compilation has a fused operator called expand_dims_transpose and it is converting a tensor of shape 8x192 to 8x1x192 . When profiled (on a ARM v7 device) standalone (i.e., no other model running concurrently), it was taking ~120us but when run along with 3 other models (with an app) it takes 2678us. Most other operators don’t deteriorate as much. I am trying to understand the cause for this ~20x increase in time consumed by this op and in this context, I have some questions for the community:

  • Are there any relay passes/ optimizations in tvm that could optimize this operator – since it is converting a tensor from 8x192 to 8x1x192, could it be done more efficiently?
  • Is it possible to access the intermediate source for this op at the time of compilation? ** I did try passing -save-temps while exporting lib to .so but I couldn’t find any temp files.

Currently, I am not explicitly taking advantage of any relay passes (hence the first point above) and exporting the lib to .so with following options:

"options":["-O2", "-s", "-std=c++14", "-fPIC", "-fpie","-static-libstdc++"]

"target":"llvm -mtriple=armv7a-linux-android -mfloat-abi=soft"

Thanks.