[quantization] Performance degradation of quantization

Wheest · May 27, 2022, 3:47pm

I’m beginning to explore quantization in TVM, and was following this tutorial on the matter. However I am seeing a performance degradation and am wondering if there is something obvious I’m missing.

I am aware there is also a tutorial on framework quantized DNNs, however I am looking to just run within TVM for now.

My issue is that I am seeing a serious performance degradation when compared to normal full precision inference (almost 7x!). I have some benchmark code that is essentially just the tutorial code with some timeit calls and the normal inference time measurement. You can find it here as a gist.

On an x86 CPU, here are my results:

global_scale {'mean': 265.5297714471817, 'median': 259.2771492898464, 'std': 16.433075202027943}
data_aware {'mean': 260.7999528199434, 'median': 258.1531709060073, 'std': 6.0100355264219845}
power2 {'mean': 257.3789160326123, 'median': 256.4086513593793, 'std': 11.723060988255977}
normal {'mean': 37.0397911965847, 'median': 37.15196903795004, 'std': 2.9463153435486023}

There is this thread talking about issues with quantization and AutoTVM, and I followed the suggestion of trying on power-of-2.

I am using the graph executor, which is different from the vm executor used in the tutorial. My impression is that the vm is for testing purposes, and graph is for production, but that is just from looking at its doc string.

The results are pretty much the same using vm:

global_scale {'mean': 256.5712309628725, 'median': 253.7820851430297, 'std': 10.615982748342928}
data_aware {'mean': 265.5166169255972, 'median': 259.52044147998095, 'std': 16.422343882896424}
power2 {'mean': 265.1222063973546, 'median': 261.5204192698002, 'std': 13.509436084401163}
normal {'mean': 33.59440628439188, 'median': 33.39773863554001, 'std': 5.698028490797699}

Any ideas on what is happening?

Wheest · May 30, 2022, 4:48pm

As a follow-up, I have now explored the Deploy a Framework-prequantized Model with TVM tutorial, and have found a similar issue: a slow-down when using quantization.

As before, here’s a gist that takes the code of the tutorial and prints the inference time when compared to a non-quantized version of the model. Note that newer versions of PyTorch may cause the script to not work, see this post.

You can see here an slowdown of over 2x.

# Quantized model
 mean (ms)   median (ms)    max (ms)     min (ms)     std (ms)  
  20.4784      18.2266      40.3948      15.8920       4.8084   
               
# Normal/F32
 mean (ms)   median (ms)    max (ms)     min (ms)     std (ms)  
   8.8737       8.3871      15.7416       7.6168       1.7178

Is there something I’m missing about quantization in TVM here?

elvin-n · May 30, 2022, 6:59pm

which CPU do you have? I tried your 1st script on Intel(R) Core™ i5-9400T and got

global_scale {'mean': 11.770177211146802, 'median': 11.210057651624084, 'std': 1.3056869897779553}
data_aware {'mean': 11.899270112626255, 'median': 11.779044149443507, 'std': 0.8899458362447247}
power2 {'mean': 11.817263718694448, 'median': 11.923514353111386, 'std': 0.7806644704390949}
normal {'mean': 22.083401512354612, 'median': 21.99623454362154, 'std': 0.8013645648961937}

BTW, I would not refer to any performance data until tuning with AutoTVM (since int8 for x86 supports tuning with AutoTVM so far), but results in int8 is twice better, that is expected for this hardware

Wheest · June 7, 2022, 2:38pm

Many thanks, I’ve found the source of my issue. I can actually get a speedup, however I had made some minor modifications to my v0.8 branch that had unexpected side effects on quantization.

I had not realized that there was a special variant of spatial pack conv, and complementary schedule for int8 inference.