TFLite Rounding

Moving here for discussion for PR - https://github.com/apache/incubator-tvm/pull/4828

Please read last 5-6 comments to get the context

@ramana-arm We can certainly use TVM instrinsics (will try to use this later today). But, first thing that we need to focus on is - do we need to follow exact TFLite rounding? As discussed in the PR, TFlite worked backwards from the ARM instructions they wanted to use, leading to 2 roundings. Directly using the reference implementation will result in large number of Relay operations, potentially affecting performance. It will also be very hard to then collect all the Relay operations and replace them with vqrdmulh intrinsic in TVM. Even if we somehow make it work for ARM, what about other platforms?

So, if we are going to use tensorize specifically for ARM anyways for vqrdmulh instruction, maybe we should try to directly use the intrinsic and not try to exactly mimic everything at Relay level.

I think this is somewhat of a nasty problem. And we will have to bite one bullet - performance or exact-tensor match. Given TFlite also makes this choice by using 2 roundings, we should also think what is the right thing to do in TVM landscape. Happy to hear other’s thoughts.

How about we provide a compile option for user to do tradeoff? like -fast-math in clang/llvm. While I feel it might be beneficial to make vqrdmulh the default behavior if it does not hurt accuracy on dataset, as performance is in most cases what people care about when using tvm.