I am still trying to grasp the structure of the TVM code base so let me summarize the current state of quantization in TVM to see if I got it right:
There is an existing way of quantizing an FP32 graph, say from tensorflow, through an existing API (relay.quantize.qconfig & relay.quantize.quantize). I tried this path for Mobilenet on several data sizes (int8, int16, int32) and the inference results are not even close to the ones I get using non-quantized values. Performance is also worse, but I assume that could be addressed by using auto-tvm graph optimizations.
There is also an effort underway to support importing a quantized TFLite model into TVM. There seems that Mxnet will be also be supported but not sure if that effort is underway. In both cases the QNN dialect is used to manipulate the input graph into a suitable Relay input.
In both cases the input graphs are either quantized (case 1) or manipulated (case 2) to generate a Relay Int8 graph. My question is related to the format of this Relay Int 8 graph. Assume that I have a quantized tensorflow model, I was thinking of converting the graph into the format that Relay needs, but it seems that the QNN dialect was built for that purpose, see issue #3900, but from that discussion it seems that the operations were built to support TFLite. Is this the
case? Is someone else working on importing quantized Tensorflow models? If so, let me know so that I can join that effort.
Quantized tf model has complex logic need to handle, which has some special ops like FakeQuant. I think we could support it in the future, because currently TFLite has helped us to handle this and we only need to parse quantized TFLite model. TF , TOCO, TFLite is one complete path for supporting tf quantization-aware training.
Basically, there exists two ways to do quantization in TVM. One way is to transform graph that has been quantized by other framework to relay format, and another way it to explore our quantization way with TVMâs ability. I have done some works on the second way, see: https://github.com/dmlc/tvm/pull/3828 . But I donât have enough time recently to continue working on that. It would be great if anyone can pick it up.
Sorry for the late reply, been busy with work. Ideally I would follow the first path you mentioned because we already have a quantization framework in place, but let me start looking in detail at how quantization works in TVM to see how I can contribute.
Yeah, you could run it now. But be careful one thing, though we have the same accuracy as tflite, we can not compare the result with tflite elementwise as said this prâs comment (https://github.com/dmlc/tvm/pull/3900). Personally I think we should do it as my comment said. I wish I could help to finish it in the near future.
Thanks for the clarification. Although the model accuracy should be the target of an evaluation, I also agree that when evaluating the accuracy of a quantized TFLite model we need to be careful since users might expect to get the same accuracy as in the TFLite implementation. For that reason it would be a good idea to have this âTFLite roundingâ in place, that can be used for the TFLite frontend and thus we can fully compare the accuracy with the TFLite implementation