[AutoScheduler] Huge gap between the estimated model latency and actual benchmark latency

Hi community,

I am new to TVM. I am using AutoScheduler to tune the mobilenet v2 quantized tflite model. I am tuning against Snapdragon 660 on Android devices

Here is my target string: llvm -device=arm_cpu -mtriple=arm64-linux-android -mattr=+neon

And I am using NDK aarch64-linux-android-clang++ to compile

The number of the trial is set to 20000 for 36 tasks. Here is the log last line:

Estimated total latency: 39.393 ms Trials: 98 Used time : 3294 s Next ID: 21

From the above log, it should be around 40ms, however, when I ran the model on device, it gives about 140ms.

Anyone has any clues about what is going wrong?

Thanks a lot!

OK, I found the root cause. It seems like my option does not accommodate quantized model very well. For float model, the actual runtime is much faster than the estimation.

Anyone has experience to auto schedule a quantized model?

The runtime estimation in auto_scheduler, only does the sum of individual operator runtimes measured on device, and does not flush cpu_caches by default when running the operators.

There are two points you could look at:

  1. Setting enable_cpu_cache_flush to True in the CPU runner
  2. To look at individual runtimes you can use the profile`` method of GraphModuleDebug`: https://github.com/apache/tvm/blob/61b66cd112960ff27972008b5b62d68822719966/python/tvm/contrib/debugger/debug_executor.py#L284 For small operators I ofthen notice a huge amount of time spent in the runtime and not in the actual operators.