I am still trying to grasp the structure of the TVM code base so let me summarize the current state of quantization in TVM to see if I got it right:
-
There is an existing way of quantizing an FP32 graph, say from tensorflow, through an existing API (relay.quantize.qconfig & relay.quantize.quantize). I tried this path for Mobilenet on several data sizes (int8, int16, int32) and the inference results are not even close to the ones I get using non-quantized values. Performance is also worse, but I assume that could be addressed by using auto-tvm graph optimizations.
-
There is also an effort underway to support importing a quantized TFLite model into TVM. There seems that Mxnet will be also be supported but not sure if that effort is underway. In both cases the QNN dialect is used to manipulate the input graph into a suitable Relay input.
In both cases the input graphs are either quantized (case 1) or manipulated (case 2) to generate a Relay Int8 graph. My question is related to the format of this Relay Int 8 graph. Assume that I have a quantized tensorflow model, I was thinking of converting the graph into the format that Relay needs, but it seems that the QNN dialect was built for that purpose, see issue #3900, but from that discussion it seems that the operations were built to support TFLite. Is this the
case? Is someone else working on importing quantized Tensorflow models? If so, let me know so that I can join that effort.
Thanks!