Slower Execution Times After 8 bit Quantization?

kfeng123 · August 11, 2023, 2:22am

Currently, this should be a common problem. And the suboptimal result may come from several distinct aspects.

In the aspect of computational graph, quantized model may introduce many additional operators, for example, cast op.

There is indeed some optimizations that can be done (but is not done) for quantized model. Recently, I am working on some computational graph level optimizations, and hopefully will be upstreamed to the main branch within 1 month. For now, if you are interested, you can try to run your model in this version and see if there is significant improvement.

Another aspect is relay.quantize.quantize. The quantization in TVM may not be well optimized currently. You can find some proposals to improve relay.quantize.quantize in the forum. However, it seems some proposals are eventually implemented in the main branch. Personally, I would like to systematically improve relay.quantize.quantize in the future. For now, perhaps you can try to use another tool (perhaps tflite?) to do quantization, and then import the quantized model to TVM. Maybe this can help improve performance.