TVM INT8 quantization does not provide 4x speedup over float on ARM CPU?

I found that using INT8 quantization does not provide a 4x speedup over float inference with TVM. It’s strange because INT8 inference should theoretically have 4x SIMD lanes than float inference and hence 4x inference speed than float inference. The following are my experiment settings:

  • Tuner: AutoTVM
  • Hardware: Pixel6 CPU
  • Number of tuning trials: 1500 trials per operator
  • Optimization level: 3
  • Network: ResNet18 for ImageNet
  • Skipped operators: the first convolution and the last linear layer are skipped according to the default configuration of TVM quantization.

Then I list the results below:

Model TVM float (ms) TVM INT8 (ms)
ResNet18 for ImageNet 50.5511 42.2194

I posted the code here:

The tuning command I use:

python -m baseline.tuning_main \
    --model resnet18 --quantize --tuning-records $TUNING_RECORDS \
    --target arm --key $KEY

The profiling command I use:

python -m baseline.profiling_main \
    --model resnet18 --quantize --opt-level 3 \
    --tuning-records $TUNING_RECORDS \
    --target arm --key $KEY
1 Like

Hi @tigertang, slower than expected performance can have many reasons, so I don’t have definitive answer, but one consideration would be the data layout - we can only get some meaningful benefit from SIMD vectors if the data that go into these vectors is consecutive in memory. QNN operations get legalized to various non-QNN and reduction operators, so it is highly unlikely you would always have the data needed for these operations consecutive in memory. It looks like the model is currently operating on a data in NCHW format? You can try experimenting with NHWC format since that would make a lot of channelwise operations “SIMDable”. But again, it is unlikely you’d see speedup as massive as 4X.

1 Like

I think data layout can be an excuse but not a reason. Take ONNX Runtime (ORT) as an example. ONNX Runtime also uses the NCHW data layout but has a much higher speedup than TVM in INT8 quantization. I don’t expect a 4x speedup but I think at least 3x speedup is reasonable.

Model ORT float (ms) ORT INT8 (ms)
ResNet18 for ImageNet 67.9256 19.642

The experiment is also conducted on Pixel6 CPU. ONNX Runtime provides a 3.45x speedup.

Also, I noticed that PyTorch does not support the NHWC data layout. Does it mean that we can only choose one of the two choices: 1. use PyTorch as the frontend; 2. get a high performance INT8 convolution neural network implementation?

You can have TVM to change the layout for you, there is a relay.transform.ConvertLayout pass you can run on the Relay mod. This pass also accepts a list of mappings between operators and the desired layouts for them, so you can convert the layouts selectively. You can see how tvmc uses it for an example - https://github.com/apache/tvm/blob/main/python/tvm/driver/tvmc/transform.py.

I also just noticed that you are importing a PyTorch model in float and quantizing it TVM? I don’t actually know what kind of Relay that produces, but if it is doing float->int quantization during runtime, to me it sounds like it could be slow… cc @masahi for wisdom on PyTorch and quantization in TVM.

1 Like

Using TVM’s quantization is not recommended. If you already have a model quantized by ORT, it is better to import that model.

Why “using TVM`s quantization is not recommended” ?