I found that using INT8 quantization does not provide a 4x speedup over float inference with TVM. It’s strange because INT8 inference should theoretically have 4x SIMD lanes than float inference and hence 4x inference speed than float inference. The following are my experiment settings:
- Tuner: AutoTVM
- Hardware: Pixel6 CPU
- Number of tuning trials: 1500 trials per operator
- Optimization level: 3
- Network: ResNet18 for ImageNet
- Skipped operators: the first convolution and the last linear layer are skipped according to the default configuration of TVM quantization.
Then I list the results below:
Model | TVM float (ms) | TVM INT8 (ms) |
---|---|---|
ResNet18 for ImageNet | 50.5511 | 42.2194 |
I posted the code here:
The tuning command I use:
python -m baseline.tuning_main \
--model resnet18 --quantize --tuning-records $TUNING_RECORDS \
--target arm --key $KEY
The profiling command I use:
python -m baseline.profiling_main \
--model resnet18 --quantize --opt-level 3 \
--tuning-records $TUNING_RECORDS \
--target arm --key $KEY