[RFC] Using arm intrinsics to implement fixed point multiplication in TVM

@tqchen The problem arises because LLVM codegen is not able to use suitable instructions. A fixed point multiply at Relay level will have to upcast the input tensors to int64. ARM instructions that @giuseros shared take int32 tensors and perform the upcasting internally in the HW (please correct me if I am wrong - @giuseros). Therefore, today QNN/Relay graphs do not use the best possible ARM instructions.

At the same time, I have similar concerns about overkill. I earlier missed this, but having a new op disallows operator fusion, leading to 1.5% speedup instead of 3% speedup.