[RFC][Quantization] Quantization in TVM

masahi · February 19, 2021, 11:51am

I’ve done the first pass through the code and left some comments there. I generally liked pattern matching based QNN op rewrite and calibration, but I have a big concern around how you are approaching requantize.

First of all, regardless of the reason for introducing requantize in a later pass, I think the current implementation is too ad hoc and brittle when dequantize/quantize are not done back to back, see [WIP] [Quantization] Quantization in TVM by electriclilies · Pull Request #7474 · apache/tvm · GitHub. I’m pretty sure we will end up with more dequantize/quantize than necessary. For standard imagenet models, there should be only one quantize/dequantize pair without counting weight/bias quantize, and possibly one more between the last convolution and dense layer. Anything more than that is not acceptable for integer-only quantization, because that’s the norm in PyTorch/TFLite. In addition to accuracy and performance metrics on imagenet models, I’d like to see the number of quantize/dequantize remaining after requantize rewrite, for each imagenet models.

Second, if I understand your explanation on why requantize is done this way, I think the root issue boils down to the fact that you are doing calibration on a QNN graph. Sure, if you need to instantiate a QNN graph before scales and zps are determined, you cannot create requantize op. But reading your code I realized that calibration can be done either on the fp32 or QNN graph, so if I decide to calibrate only on the fp32 graph, your arugment for introducing requantize later wouldn’t apply anymore. Moreover, even if I decided to calibrate on the QNN graph, I don’t see why we need to go through the complicated and error-prone rewriting process to introduce requantize ops. After you determine all quantization parameters, you should be able to create a new QNN graph from the fp32 graph again, this time using requantize ops. The way I convert a quantized PyTorch model to QNN is actually similar, first I do one pass to make all qparams explicit in the graph, and then I do another pass to instantiate each QNN node, including requantize.

Finally, I’m not sure if doing calibration on a QNN graph is a good idea. I believe the standard approach is to calibrate on a fp32 graph, and then construct a quantized graph with requantize using the calculated qparams. You mentioned something about KL, but if I remember correctly KL only needs the histogram of activatations, so absolute values don’t matter and dequantize is not necessary (I could be wrong, though). Instantiating a QNN graph before qparams are chosen also introduces a nasty problem of how to decide the initial parameters. I think the final parameters depend on the initial ones, so the choice of initial values shouldn’t be arbitrary. Your code initializes all scales to be 1, but I think that is incorrect [WIP] [Quantization] Quantization in TVM by electriclilies · Pull Request #7474 · apache/tvm · GitHub. Anyway, I think the rationale for doing calibration on a QNN graph is questionable and it should only be for niche-use at best, I’m not convinced that it justifies the introduction of all the rewriting complexity for requantize.

Overall, if we want to deviate from the standard approach that is proven to work well, there should be a very good reason to do so. And ideally the justification should come with a working demonstration, rather than hand-waving explanations alone.

We can add dequantize after requantize, so this is not a problem.