Supporting bit exact TFLite QNN inference

I’ve been implementing the TFLite PostProcess Detection operator (which is used in SSD Mobilenet) (see https://github.com/apache/incubator-tvm/pull/4543/). However, there is a difference between the results of tflite and tvm for qnn graphs that I think is due to a difference in rounding scheme (and potentially operating lowering).

For most operators this effect is not too significant as we can write tests with a +/- 1 tolerance on the outputs. However, part of this custom op sorts the detected objects by confidence and then takes only the top ‘n’ results. This means only a small difference in the outputs can result in a significant difference in the output tensor as the order of the detections is different. This is particularly noticeable when it causes a different detection to get clipped as this results in different information in the output tensors between tvm and tflite.

Writing end-to-end tests for this case is therefore quite difficult and it would be preferable if we could run tvm in a ‘tflite’ mode where it used an identical rounding scheme (and op implementations if necessary). I note that @FrozenGene is looking into this and am just posting this as an example of where bit exact computations would be valuable. Do we have an idea of what would be required to support this behaviour?

@janimesh

My colleague @yunjing_lh is following this issue. I think he can share more detail.

@mbaret I’ve done a layer by layer comparison of tvm/qnn results with tflite results of a resnet, and glad to know the value of bit-exact computation. I’m currently investigating the problem with my spare time. The first step to do is to provide a tflite-like rounding scheme in tvm, but I haven’t ruled out other potential causes to the problem like the discrepancy in computation paradigm. Will let you know if any progress is made. @FrozenGene

@mbaret First of all, it seems you are working on something very tedious and painful, so just keep hanging :slight_smile:

In my opinion, the major difference in TFLite and TVM comes from the FixedPointMultiply (part of requantize operator). Whenever, there is a fixed point multiplication, we need to choose a rounding mechanism. Currently, the Relay rounding is different than TFLite rounding. As @yunjing_lh points out, there might be other factors as well, like sequence of operators that can cause this mismatch.

So, for your usecase, maybe we can try to figure out where the differences exist. Some questions to ask - 1) Do you call requantize/FixedPointMultiply? FixedPointMultiply will also be called internally in Add/Mul/Concatenate operator. 2) For any of the computations, is there a rounding happening?, 3) For debugging, we can always dequantize and keep the computations in FP32.