TVM int8 quantization slower than float on arm?

tigertang · August 7, 2022, 12:27pm

Platform: Pixel4 CPU

Network: ResNet18 on ImageNet from torchvision

Tuning: AutoTVM with XGBTuner for 1500 iterations

ResNet18 latency: 100ms

ResNet18 int8 quantized: 180ms

tigertang · August 7, 2022, 12:38pm

I posted the code here:

python -m baseline.tuning_main --quantize --target arm

Lyken17 · August 10, 2022, 2:35pm

Also notice similar behavior on rapsberry pi (arm chip). Have you tried AutoTVM instead of AutoScheduler?

elenkalda-arm · August 11, 2022, 1:37pm

I’d try something like

target = "llvm -device=arm_cpu -mtriple=aarch64-linux-android -mattr=+v8.2a,+dotprod"

and yes, use AutoTVM instead of AutoScheduler

alopez_13 · August 11, 2022, 1:51pm

I also would explore changing the default layout. I have found that for some models NHWC yields better results for int8 than NCHW.

tigertang · August 12, 2022, 1:20am

Thank you so much! You are right!