How to indicate LLVM use `VQMOVN.s16`

Clamp Op not indicate LLVM generate VQMOVN.s16

    dtype = 'int8'

    M = 224
    N = 224
    # Algorithm
    A = te.placeholder((M, M), name="A", dtype=dtype)
    B = te.placeholder((M, N), name="B", dtype=dtype)

    C = te.compute((M, N), lambda x, y: (A[x, y].astype('int16') * B[x, y].astype('int16')), name="C")
    C = topi.clip(C, tvm.tir.min_value(dtype).value, tvm.tir.max_value(dtype).value).astype(dtype)
    # C = C.astype(dtype)

    s = te.create_schedule(C.op)
    print(tvm.lower(s, [A, B], simple_mode=True))
    func = tvm.build(s, [A, B, C], target=target, name="scale_one")
    print(func.get_source('asm'))

generated assmebly code:

	vld1.64	{d20, d21}, [r1:128]
	vmin.s16	q10, q10, q8
	vmax.s16	q10, q10, q9
	vst1.64	{d20, d21}, [r1:128]

Because clip's definition is

    def _compute(*indices):
        value = x(*indices)
        const_min = tvm.tir.const(a_min, value.dtype)
        const_max = tvm.tir.const(a_max, value.dtype)
        return tvm.te.max(tvm.te.min(value, const_max), const_min)

so we generate vmax and vmin from max and min

However, the latest LLVM could optimize it. I mean LLVM 11. You could add one flag -mattr=+mve. I test it. We could generate VQMOVN instruction correctly, but you should make sure your arm cpu could support this instruction.

1 Like

With flag -mattr=+mve generate VQMOVNB.S16 instead of VQMOVN.S16.

VQMOVNB.S16 is not supported by my target device.

My compile target define:

target_t = 'llvm -device=arm_cpu -model=bcm2835 -mtriple=armv7a-linux-gnueabihf -mattr=+neon,+mve'

--------------------------- split line -----------------------

My main goal is efficiently execute quantized model.
print(tvm.lower(s, [A, B], simple_mode=True)) output:

primfn(A_1: handle, B_1: handle) -> ()
  attr = {"global_symbol": "main", "tir.noalias": True}
  buffers = {B: Buffer(B_2: Pointer(int8), int8, [224, 224], []),
             A: Buffer(A_2: Pointer(int8), int8, [224, 224], [])}
  buffer_map = {A_1: A, B_1: B} {
  attr [C: Pointer(int16)] "storage_scope" = "global";
  allocate(C, int16, [50176]);
  attr [compute: Pointer(int8)] "storage_scope" = "global";
  allocate(compute, int8, [50176]) {
    for (x: int32, 0, 224) {
      for (y: int32, 0, 224) {
        C[((x*224) + y)] = (cast(int16, (int8*)A_2[((x*224) + y)])*cast(int16, (int8*)B_2[((x*224) + y)]))
      }
    }
    for (i0: int32, 0, 224) {
      for (i1: int32, 0, 224) {
        C[((i0*224) + i1)] = max(min((int16*)C[((i0*224) + i1)], 127i16), -128i16)
      }
    }
    for (i0_1: int32, 0, 224) {
      for (i1_1: int32, 0, 224) {
        compute[((i0_1*224) + i1_1)] = cast(int8, (int16*)C[((i0_1*224) + i1_1)])
      }
    }
  }
}

Generated assembly code with VQMOVNB.S16 still have three loops. It allocates a large buffer for temporary data.

How to merge three loop to one?

Assembly code I want:

	vld1.8			{d16}, [r9:64], r1
	vld1.8			{d18}, [r4:64], r1
	vmull.s8		q1, d18, d16
	vqmovn.s16		d18, q1
	vst1.8			{d18}, [r5:64], lr

I will point out that MVE and Neon will never be present on the same CPU implementation, using them together in the same output object file and indeed in the same function is not correct.

MVE is suitable for the M profile instruction set of the AArch32 ISA i.e. micro-controllers while Neon or Advanced SIMD is suitable for the A profile variant of the AArch32 ISA.

Ramana

Thanks for correct me.

Where could I find full documentation of LLVM neon, mve attrs.
I have searched countless time, still no clue.

Finally figure out working solution.
Modify LLVM to support optimize truncate(smax(smin(a,b),c) ---> vqmovns(c)

tvm python script:

    dtype = 'uint8'
    ddtype = 'uint16'

    M = 224
    N = 224
    # Algorithm
    A = te.placeholder((M, M), name="A", dtype=dtype)
    B = te.placeholder((M, N), name="B", dtype=dtype)
    min_v = tir.min_value(dtype).value
    max_v = tir.max_value(dtype).value
    C = te.compute((M, N), lambda x, y: A[x, y].astype(ddtype) * B[x, y].astype(ddtype), name="C")

    C_clip = topi.clip(C, min_v, max_v)
    C_int8 = C_clip.astype(dtype)

    s = te.create_schedule(C_int8.op)
    s[C].compute_inline()
    s[C_clip].compute_inline()

    yo, yi = s[C_int8].split(C_int8.op.axis[1], 8)
    s[C_int8].vectorize(yi)

    print(tvm.lower(s, [A, B, C_int8], simple_mode=True))
    target_t = 'llvm -device=arm_cpu -model=bcm2835 -mtriple=armv7a-linux-gnueabihf -mattr=+neon'

    with tvm.transform.PassContext(opt_level=3):
        func = tvm.build(s, [A, B, C_int8], target=target_t, name="scale_one")

    print(func.get_source('asm'))
1 Like