I am trying to quantize and tune some TF models on x86. However, the performance results are extremely poor compare with the non-quantize version. The numbers are as follows:
First model
TVM FP32: 35.05ms
TVM int8 quantization: 80.ms
TVM int8 quantization + AutoTVM: 46.87ms
Second model
TVM FP32: 72.85ms
TVM int8 quantization: 159.33ms
TVM int8 quantization + AutoTVM: 112.39ms
What is the reason for such a bad performance? What can be done to try to improve performance?
I would suggest comparing performance of conv2d layer by layer to see if we can improve current int8 conv2d implementation. We can also check if the fusion result (after FuseOps pass) is optimal.
Thanks for the suggestions. I will compare every conv2d performance with the TVM profiler. Regarding the fusion result, how can we verify if it is optimal?
After checking the TVM profiler and compare with and without quantization it is clear that in the quantized version the fused convolutions are slower. Actually, are twice as slow as in FP.
Also I see that because of the data layout used, the added transposed operators are not quantized, which means that before and after every convolution there is the translation from INT fo FP. This of course adds a lot of overhead.
Do you have any suggestions or thoughts about this?