OK, I found the root cause. It seems like my option does not accommodate quantized model very well. For float model, the actual runtime is much faster than the estimation.
Anyone has experience to auto schedule a quantized model?
The runtime estimation in auto_scheduler, only does the sum of individual operator runtimes measured on device, and does not flush cpu_caches by default when running the operators.
There are two points you could look at:
Setting enable_cpu_cache_flush to True in the CPU runner