[RFC] Using arm intrinsics to implement fixed point multiplication in TVM

It makes sense now, thanks a lot @kparzysz !

fpmq would be my intrinsic, and it does the fixed point multiplication:

def fixed_point_multiply(x, y, n) 
    x = cast(x,int64) * y 
    pos_rounding_value = 1 << (n -1) 
    x = x + pos_rounding_value 
    x = x >> n 
    return cast(x, int32)

Which I call from the TOPI operator that I can overload for the arm target and use arm intrinsics.

However, I am sligthly worried about performance. Because in the default non-arm case, I would do two shifts (by n and by s), instead of combining everything into a single shift (n+s) - which is called total_right_shift in the original code.

What do you think?