Thanks, everyone for insightful discussions. I also want to point everyone back to the original quantization workflow RFC: https://github.com/dmlc/tvm/issues/2259
There are a few facts that I want to highlight.
There are more than one quantization schemes and different resulting speed-accuracy tradeoffs
“Quantization” is a generic term that has been used for many methods, specifically, there are choices of
- Different bitwidth, sign/unsigned in different layers
- Symmetric vs asymmetric
- Can use floating pt multiplication vs force to only use integer shift
Most frameworks, like TFLite tries to have one opinionated view about the quantized model this. There is a good need to support importing these models, but we also need to keep in mind that adopting directly adopting the implied scheme may not give the best performance.
Sometimes, we might even want to use a mixed-scheme across neural networks. Just like many of the discussions mentioned here. Most troubles are due to the asymetric scheme
We should always keep this in mind. There is no one perfect solution and we want to build API to make sure we can cover more
How many quantized ops we want to introduce
As a general rule of thumb, I would recommend us to minimize the number of new operators as well as their names. For example, both quantize and dequantize can likely be converted to subtraction multiplication and cast. This means that we do not want to introduce these operators at least not at the core level. The less new operators we introduce, the easier it is to reuse many of the existing infrastructures. We can likely do a similiar thing for relu(becomes clip) and maxpool(which will be the same, assuming a single zero point). As a matter of fact, giving up asymmetry will enable most of the quantization pipeline fall into the current core operator.
We can, however, introduce dialects that facilitate the conversion, in such case relay.contrib.q_conv2d or is an OK name as long as we try to lower as many of them as possible to the low0level operators.