fmpq
instrinsic sounds good, it would be good to also lookup precedence in terms of the intrinsic name, and discuss a few naming alternatives.
Unfortunately I wasnât able to find lots of quantized intrinsics around (most of them are for floating points).
Anyway, for the intrinsic I was thinking to something like:
mulq
mul_q
qmul
q_mul
I would say to leave the TOPI operator and the Relay operator with the same name, that could be picked among:
-
fixed_point_multiply
(as before) fixed_point_multiply_and_scale
fixed_point_rescale
I am not extremely good with names, so if you come up with better ideas, I am all ears
cc: @tqchen, @kparzysz, @anijain2305
Hi all,
I had a go at it, but there is a small issue with accuracy.
The way it was before, everything was done in int64
, shifting and casting at the very end.
Now the qmul
intrinsic introduces a shift (loosing some precision), a cast to int32
and then I further shift in order to scale (+ round up).
In floating point, this is the equivalent of doing round(round(x*y)/scale)
instead of round(x*y/scale)
.
For some edge cases, e.g (23/16) this produces the wrong value:
-
round(round(23*0.5)/8)
= 2 -
round(23*0.5/8)=1
.
This is the exact same rounding error I would introduce through arm intrinsics, and should match TFlite. So I think it should be fine, although this is something we have to decide (because I would need to add a +1/-1 tolerance to the tests).
The alternatives to this are:
- going back to the previous design (where everything was done within the intrinsic)
- moving everything in TOPI (without adding the intrinsic).
What do you think? @tqchen @anijain2305 @kparzysz
I think it is fine .
as long as the behavior is well documented, i agree that it is fine.
Sorry for the late reply. I think it is fine too with some minor suggestions.
I would encourage to ensure that things work well with non-ARM platforms as well. We want to ensure that a Relay operator has same output when compiled/executed for ARM and non-ARM machine. Secondly, we want to ensure that this does not severely degrade performance for a non-ARM machine.
The reason I ask this is because we are doing 2 roundings here, and we are thinking backwards from ARM instructions standpoint and writing TVM compute (while traditional way is to first write TVM compute and then find suitable instructions). A non-ARM machine might not have those instructions, and two roundings can degrade performance (hopefully minimal).
Therefore, I ask to ensure that we get good performance for non-ARM machine while maintaining accuracy. Similar to TFLite, there has been some x86 work to match MKLDNN performance and we should ensure that the rough numbers remain same.
So, I had a bit of investigation, and @anijain2305 I think you are right.
In my (general) TOPI operator, I do as follows:
val = x(*indices)
val = tvm.tir.if_then_else(left_shift > 0, val << left_shift, val)
mulq = tvm.tir.fixed_point_multiply(val, multiplier, 31)
return (mulq+nudge) >> right_shift
My hope (as @kparzysz was suggesting) was that the last shift, being a constant, would have been folded with the shift in fixed_point_multiply
. However, this doesnât seem to be the case.
Indeed, when I look at the assembly code generated (without using the overloaded version with the arm intrinsic), I see an additional sshr vX.4s, vX.4s, #const
instruction generated, which is exactly the last round-up I am doing.
If I remove the last shift and only return mulq
, the sshr
instruction disappears. I think LLVM can fold two consecutive shifts, but not a sequence of shift - cast- add -shift
. Even if I remove the cast, there is always an add
in the middle (that I donât think I can remove) which makes the shift-fusion not possible.
At this point, I have some doubts for this to be the right approach. Overall, we would end up with something slightly less accurate and slightly slower (in the general case).
While I understand the point of an intrinsic to be as much general as possible, in this case I think specialization should be favored for the sake of performance and accuracy.
What I mean is that instead of introducing the general q_mul
intrinsic, we could stick with a more specific intrinsic that does the multiplication and the shift together.
This would be similar to the initial proposal, but I can introduce the q-ness as further input to the intrinsic, in order to generalize it to all q-numbers.
The other alternative can still be the full implementation in TOPI. The reasons I would prefer the intrinsic are :
- We can invoke the intrinsic within a
compute
function (as @kparzysz was mentioning) - It is way easier to overload an intrinsic. To overload in TOPI I would need to add a compute strategy specialization for the TOPI operator, which seems a bit too much
- I think multiplying an integer
x
by am*2^-n
(wherem
is a q-number) is still general enough to deserve its own TIR instruction.
@tqchen, @anijain2305, @kparzysz what do you think?
Could you show where nudge
is coming from?
Hi @kparzysz,
nudge is defined as a constant:
nudge = (1 << (right_shift-1))
Which I use to round toward nearest (right_shift
is -s
in fpm(x,m,s)). Am I missing something? I wasnât able to have the tests pass without this.
A possible solution would be to define an intrinsic like (naming tbd):
fpmq(x,y,n,s)
In this way we can achieve q_mul
by setting s
to zero, but we wouldnât incur in any of the previous pitfalls.
I think that would make sense.
Ok, I amended the RFC. Please, let me know what you guys think.
Few things to mention:
-
fixed_point_multiply
is still the name of the relay (and TOPI) op (which uses the intrinsic under the hood) -
qmuls
is the name of the new intrinsic
Please, let me also know if the naming is fine or you prefer different names.