Hello,
I encountered a situation with a model which upon compilation has a fused operator called expand_dims_transpose
and it is converting a tensor of shape 8x192
to 8x1x192
. When profiled (on a ARM v7 device) standalone (i.e., no other model running concurrently), it was taking ~120us but when run along with 3 other models (with an app) it takes 2678us. Most other operators don’t deteriorate as much. I am trying to understand the cause for this ~20x increase in time consumed by this op and in this context, I have some questions for the community:
- Are there any relay passes/ optimizations in tvm that could optimize this operator – since it is converting a tensor from
8x192
to8x1x192
, could it be done more efficiently? - Is it possible to access the intermediate source for this op at the time of compilation?
** I did try passing
-save-temps
while exporting lib to .so but I couldn’t find any temp files.
Currently, I am not explicitly taking advantage of any relay passes (hence the first point above) and exporting the lib to .so with following options:
"options":["-O2", "-s", "-std=c++14", "-fPIC", "-fpie","-static-libstdc++"]
"target":"llvm -mtriple=armv7a-linux-android -mfloat-abi=soft"
Thanks.