TVM INT8 quantization does not provide 4x speedup over float on ARM CPU?

tigertang · October 22, 2022, 10:55am

I found that using INT8 quantization does not provide a 4x speedup over float inference with TVM. It’s strange because INT8 inference should theoretically have 4x SIMD lanes than float inference and hence 4x inference speed than float inference. The following are my experiment settings:

Tuner: AutoTVM
Hardware: Pixel6 CPU
Number of tuning trials: 1500 trials per operator
Optimization level: 3
Network: ResNet18 for ImageNet
Skipped operators: the first convolution and the last linear layer are skipped according to the default configuration of TVM quantization.

Then I list the results below:

Model	TVM float (ms)	TVM INT8 (ms)
ResNet18 for ImageNet	50.5511	42.2194

I posted the code here:

The tuning command I use:

python -m baseline.tuning_main \
    --model resnet18 --quantize --tuning-records $TUNING_RECORDS \
    --target arm --key $KEY

The profiling command I use:

python -m baseline.profiling_main \
    --model resnet18 --quantize --opt-level 3 \
    --tuning-records $TUNING_RECORDS \
    --target arm --key $KEY

elenkalda-arm · October 24, 2022, 10:55am

Hi @tigertang, slower than expected performance can have many reasons, so I don’t have definitive answer, but one consideration would be the data layout - we can only get some meaningful benefit from SIMD vectors if the data that go into these vectors is consecutive in memory. QNN operations get legalized to various non-QNN and reduction operators, so it is highly unlikely you would always have the data needed for these operations consecutive in memory. It looks like the model is currently operating on a data in NCHW format? You can try experimenting with NHWC format since that would make a lot of channelwise operations “SIMDable”. But again, it is unlikely you’d see speedup as massive as 4X.

tigertang · October 24, 2022, 1:01pm

I think data layout can be an excuse but not a reason. Take ONNX Runtime (ORT) as an example. ONNX Runtime also uses the NCHW data layout but has a much higher speedup than TVM in INT8 quantization. I don’t expect a 4x speedup but I think at least 3x speedup is reasonable.

Model	ORT float (ms)	ORT INT8 (ms)
ResNet18 for ImageNet	67.9256	19.642

The experiment is also conducted on Pixel6 CPU. ONNX Runtime provides a 3.45x speedup.

tigertang · October 24, 2022, 1:11pm

Also, I noticed that PyTorch does not support the NHWC data layout. Does it mean that we can only choose one of the two choices: 1. use PyTorch as the frontend; 2. get a high performance INT8 convolution neural network implementation?

elenkalda-arm · October 24, 2022, 4:31pm

You can have TVM to change the layout for you, there is a relay.transform.ConvertLayout pass you can run on the Relay mod. This pass also accepts a list of mappings between operators and the desired layouts for them, so you can convert the layouts selectively. You can see how tvmc uses it for an example - https://github.com/apache/tvm/blob/main/python/tvm/driver/tvmc/transform.py.

I also just noticed that you are importing a PyTorch model in float and quantizing it TVM? I don’t actually know what kind of Relay that produces, but if it is doing float->int quantization during runtime, to me it sounds like it could be slow… cc @masahi for wisdom on PyTorch and quantization in TVM.

masahi · October 24, 2022, 7:29pm

Using TVM’s quantization is not recommended. If you already have a model quantized by ORT, it is better to import that model.

nsaleh · September 20, 2024, 2:49pm

Why “using TVM`s quantization is not recommended” ?