[RFC] Using arm intrinsics to implement fixed point multiplication in TVM

Let me thank you all guys for the interesting discussion.

The main reason I implemented fpm as intrinsic was because I thought it similar to other operations like mul, add, sub, etc… (and I thought that different vendors might be plugging in their intrinsics to implement fpm)

But @kparzysz, you are absolutely right. If I implement this as a TOPI operator, it won’t be usable within a compute node. I didn’t think about that.

About writing this as fpm(x,y,s), I am not sure I can do that. The aim of the fixed point multiply is to multiply an int32 number x by a floating point number expressed as int32(round(2^30*M))*2^s, where M and s are the output of [M,s] = frexp(f), f being a float32 data. In fpm(x,m,s) I expect m=round(2^30*M) and s to be the shift that comes from frexp. In other words, I expect m and s to represent a floating point number with a given fixed point representation.

@tqchen, thanks for the explanation. Now everything is clear. However, we can implement the first optimization directly in Relay (if (is_power_2(scale)) shift else fpm)), and I guess we might implement the second optimization directly in TIR or in TOPI

PS I just uploaded the PR, please have a look

What you’re describing is a multiplication of Q numbers.

If we introduce a new intrinsic, we should express it in terms of relatively well-known concepts, and Q numbers are a standard concept in fixed-point arithmetic. Even your mathematical expression shows that x and m are interchangeable, so the description should not suggest any asymmetry.

Thanks for the link @kparzysz , I didn’t know about Q numbers.

I agree on expressing the intrinsic in terms of Q numbers, but I don’t follow when you say that x and m are interchangeable.

The point is that fpm(x,m,s) multiplies x (which is not a Q number) by a Q1.30 number (I think) described by m and s.

How can I exchange m with x?

Let’s say that fmpq(x,y,n) is defined as a fixed point multiplication of two Qk.n numbers x and y. Assume that k+m is 32, so that both x and y can be represented as int32 values.

Then, the fpm(x,m,s) that you want to implement is fmpq(2*x,m,31) * 2^s. As a matter of fact, the first operand in this multiplication is exactly sqrdmulh(x,m).

You can then invent the new topi operator, call it “fixed_point_multiply_and_scale” (or some better name), and implement it using the fmpq intrinsic with the scaling by 2^s.

Finally, you can “realize” that the original goal of multiplying an integer by a normalized floating point value is equivalent to the “fixed_point_multiply_and_scale”. This solves the original problem, and introduces a general TIR intrinsic for Q-number multiplication.

It makes sense now, thanks a lot @kparzysz !

fpmq would be my intrinsic, and it does the fixed point multiplication:

def fixed_point_multiply(x, y, n) 
    x = cast(x,int64) * y 
    pos_rounding_value = 1 << (n -1) 
    x = x + pos_rounding_value 
    x = x >> n 
    return cast(x, int32)

Which I call from the TOPI operator that I can overload for the arm target and use arm intrinsics.

However, I am sligthly worried about performance. Because in the default non-arm case, I would do two shifts (by n and by s), instead of combining everything into a single shift (n+s) - which is called total_right_shift in the original code.

What do you think?

The two shifts most likely get folded into one in the final code. LLVM will do that for sure. Also, keeping these shifts separate initially could help other targets that have instructions for fixed point multiplication.

Right, because those are all constants. Thanks @kparzysz , I like this design!

@tqchen, @anijain2305 what do you think? I would like to agree on a given direction and then I can go on and apply those changes

fmpq instrinsic sounds good, it would be good to also lookup precedence in terms of the intrinsic name, and discuss a few naming alternatives.

Unfortunately I wasn’t able to find lots of quantized intrinsics around (most of them are for floating points).

Anyway, for the intrinsic I was thinking to something like:

  • mulq
  • mul_q
  • qmul
  • q_mul

I would say to leave the TOPI operator and the Relay operator with the same name, that could be picked among:

  • fixed_point_multiply (as before)
  • fixed_point_multiply_and_scale
  • fixed_point_rescale

I am not extremely good with names, so if you come up with better ideas, I am all ears :slight_smile:

cc: @tqchen, @kparzysz, @anijain2305

Hi all,

I had a go at it, but there is a small issue with accuracy.

The way it was before, everything was done in int64, shifting and casting at the very end.

Now the qmul intrinsic introduces a shift (loosing some precision), a cast to int32 and then I further shift in order to scale (+ round up).

In floating point, this is the equivalent of doing round(round(x*y)/scale) instead of round(x*y/scale).

For some edge cases, e.g (23/16) this produces the wrong value:

  • round(round(23*0.5)/8) = 2
  • round(23*0.5/8)=1.

This is the exact same rounding error I would introduce through arm intrinsics, and should match TFlite. So I think it should be fine, although this is something we have to decide (because I would need to add a +1/-1 tolerance to the tests).

The alternatives to this are:

  • going back to the previous design (where everything was done within the intrinsic)
  • moving everything in TOPI (without adding the intrinsic).

What do you think? @tqchen @anijain2305 @kparzysz

Hi @tqchen, @kparzysz, @anijain2305 Any more thoughts on this?

Thanks

I think it is fine .

1 Like

as long as the behavior is well documented, i agree that it is fine.

1 Like

Sorry for the late reply. I think it is fine too with some minor suggestions.

I would encourage to ensure that things work well with non-ARM platforms as well. We want to ensure that a Relay operator has same output when compiled/executed for ARM and non-ARM machine. Secondly, we want to ensure that this does not severely degrade performance for a non-ARM machine.

The reason I ask this is because we are doing 2 roundings here, and we are thinking backwards from ARM instructions standpoint and writing TVM compute (while traditional way is to first write TVM compute and then find suitable instructions). A non-ARM machine might not have those instructions, and two roundings can degrade performance (hopefully minimal).

Therefore, I ask to ensure that we get good performance for non-ARM machine while maintaining accuracy. Similar to TFLite, there has been some x86 work to match MKLDNN performance and we should ensure that the rough numbers remain same.

So, I had a bit of investigation, and @anijain2305 I think you are right.

In my (general) TOPI operator, I do as follows:

        val = x(*indices)
        val = tvm.tir.if_then_else(left_shift > 0, val << left_shift, val)
        mulq =  tvm.tir.fixed_point_multiply(val, multiplier, 31)
        return (mulq+nudge) >> right_shift 

My hope (as @kparzysz was suggesting) was that the last shift, being a constant, would have been folded with the shift in fixed_point_multiply. However, this doesn’t seem to be the case.

Indeed, when I look at the assembly code generated (without using the overloaded version with the arm intrinsic), I see an additional sshr vX.4s, vX.4s, #const instruction generated, which is exactly the last round-up I am doing.

If I remove the last shift and only return mulq, the sshr instruction disappears. I think LLVM can fold two consecutive shifts, but not a sequence of shift - cast- add -shift. Even if I remove the cast, there is always an add in the middle (that I don’t think I can remove) which makes the shift-fusion not possible.

At this point, I have some doubts for this to be the right approach. Overall, we would end up with something slightly less accurate and slightly slower (in the general case).

While I understand the point of an intrinsic to be as much general as possible, in this case I think specialization should be favored for the sake of performance and accuracy.

What I mean is that instead of introducing the general q_mul intrinsic, we could stick with a more specific intrinsic that does the multiplication and the shift together.

This would be similar to the initial proposal, but I can introduce the q-ness as further input to the intrinsic, in order to generalize it to all q-numbers.

The other alternative can still be the full implementation in TOPI. The reasons I would prefer the intrinsic are :

  1. We can invoke the intrinsic within a compute function (as @kparzysz was mentioning)
  2. It is way easier to overload an intrinsic. To overload in TOPI I would need to add a compute strategy specialization for the TOPI operator, which seems a bit too much
  3. I think multiplying an integer x by a m*2^-n (where m is a q-number) is still general enough to deserve its own TIR instruction.

@tqchen, @anijain2305, @kparzysz what do you think?

Could you show where nudge is coming from?

Hi @kparzysz,

nudge is defined as a constant:

nudge = (1 << (right_shift-1))

Which I use to round toward nearest (right_shift is -s in fpm(x,m,s)). Am I missing something? I wasn’t able to have the tests pass without this.

A possible solution would be to define an intrinsic like (naming tbd):

fpmq(x,y,n,s)

In this way we can achieve q_mul by setting s to zero, but we wouldn’t incur in any of the previous pitfalls.

I think that would make sense.

Ok, I amended the RFC. Please, let me know what you guys think.

Few things to mention:

  • fixed_point_multiply is still the name of the relay (and TOPI) op (which uses the intrinsic under the hood)
  • qmuls is the name of the new intrinsic

Please, let me also know if the naming is fine or you prefer different names.