I found that using INT8 quantization does not provide a 4x speedup over float inference with TVM. It’s strange because INT8 inference should theoretically have 4x SIMD lanes than float inference and hence 4x inference speed than float inference. The following are my experiment settings:
Tuner: AutoTVM
Hardware: Pixel6 CPU
Number of tuning trials: 1500 trials per operator
Optimization level: 3
Network: ResNet18 for ImageNet
Skipped operators: the first convolution and the last linear layer are skipped according to the default configuration of TVM quantization.
Hi @tigertang, slower than expected performance can have many reasons, so I don’t have definitive answer, but one consideration would be the data layout - we can only get some meaningful benefit from SIMD vectors if the data that go into these vectors is consecutive in memory. QNN operations get legalized to various non-QNN and reduction operators, so it is highly unlikely you would always have the data needed for these operations consecutive in memory. It looks like the model is currently operating on a data in NCHW format? You can try experimenting with NHWC format since that would make a lot of channelwise operations “SIMDable”. But again, it is unlikely you’d see speedup as massive as 4X.
I think data layout can be an excuse but not a reason. Take ONNX Runtime (ORT) as an example. ONNX Runtime also uses the NCHW data layout but has a much higher speedup than TVM in INT8 quantization.
I don’t expect a 4x speedup but I think at least 3x speedup is reasonable.
Model
ORT float (ms)
ORT INT8 (ms)
ResNet18 for ImageNet
67.9256
19.642
The experiment is also conducted on Pixel6 CPU. ONNX Runtime provides a 3.45x speedup.
Also, I noticed that PyTorch does not support the NHWC data layout. Does it mean that we can only choose one of the two choices: 1. use PyTorch as the frontend; 2. get a high performance INT8 convolution neural network implementation?
You can have TVM to change the layout for you, there is a relay.transform.ConvertLayout pass you can run on the Relay mod. This pass also accepts a list of mappings between operators and the desired layouts for them, so you can convert the layouts selectively. You can see how tvmc uses it for an example - https://github.com/apache/tvm/blob/main/python/tvm/driver/tvmc/transform.py.
I also just noticed that you are importing a PyTorch model in float and quantizing it TVM? I don’t actually know what kind of Relay that produces, but if it is doing float->int quantization during runtime, to me it sounds like it could be slow… cc @masahi for wisdom on PyTorch and quantization in TVM.