I’m beginning to explore quantization in TVM, and was following this tutorial on the matter. However I am seeing a performance degradation and am wondering if there is something obvious I’m missing.
I am aware there is also a tutorial on framework quantized DNNs, however I am looking to just run within TVM for now.
My issue is that I am seeing a serious performance degradation when compared to normal full precision inference (almost 7x!). I have some benchmark code that is essentially just the tutorial code with some timeit
calls and the normal inference time measurement. You can find it here as a gist.
On an x86 CPU, here are my results:
global_scale {'mean': 265.5297714471817, 'median': 259.2771492898464, 'std': 16.433075202027943}
data_aware {'mean': 260.7999528199434, 'median': 258.1531709060073, 'std': 6.0100355264219845}
power2 {'mean': 257.3789160326123, 'median': 256.4086513593793, 'std': 11.723060988255977}
normal {'mean': 37.0397911965847, 'median': 37.15196903795004, 'std': 2.9463153435486023}
There is this thread talking about issues with quantization and AutoTVM, and I followed the suggestion of trying on power-of-2.
I am using the graph
executor, which is different from the vm
executor used in the tutorial. My impression is that the vm
is for testing purposes, and graph
is for production, but that is just from looking at its doc string.
The results are pretty much the same using vm
:
global_scale {'mean': 256.5712309628725, 'median': 253.7820851430297, 'std': 10.615982748342928}
data_aware {'mean': 265.5166169255972, 'median': 259.52044147998095, 'std': 16.422343882896424}
power2 {'mean': 265.1222063973546, 'median': 261.5204192698002, 'std': 13.509436084401163}
normal {'mean': 33.59440628439188, 'median': 33.39773863554001, 'std': 5.698028490797699}
Any ideas on what is happening?