With flag -mattr=+mve  generate VQMOVNB.S16 instead of VQMOVN.S16.
VQMOVNB.S16 is not supported by my target device.
My compile target define:
target_t = 'llvm -device=arm_cpu -model=bcm2835 -mtriple=armv7a-linux-gnueabihf -mattr=+neon,+mve'
--------------------------- split line -----------------------
My main goal is efficiently execute quantized model.
    print(tvm.lower(s, [A, B], simple_mode=True)) output:
primfn(A_1: handle, B_1: handle) -> ()
  attr = {"global_symbol": "main", "tir.noalias": True}
  buffers = {B: Buffer(B_2: Pointer(int8), int8, [224, 224], []),
             A: Buffer(A_2: Pointer(int8), int8, [224, 224], [])}
  buffer_map = {A_1: A, B_1: B} {
  attr [C: Pointer(int16)] "storage_scope" = "global";
  allocate(C, int16, [50176]);
  attr [compute: Pointer(int8)] "storage_scope" = "global";
  allocate(compute, int8, [50176]) {
    for (x: int32, 0, 224) {
      for (y: int32, 0, 224) {
        C[((x*224) + y)] = (cast(int16, (int8*)A_2[((x*224) + y)])*cast(int16, (int8*)B_2[((x*224) + y)]))
      }
    }
    for (i0: int32, 0, 224) {
      for (i1: int32, 0, 224) {
        C[((i0*224) + i1)] = max(min((int16*)C[((i0*224) + i1)], 127i16), -128i16)
      }
    }
    for (i0_1: int32, 0, 224) {
      for (i1_1: int32, 0, 224) {
        compute[((i0_1*224) + i1_1)] = cast(int8, (int16*)C[((i0_1*224) + i1_1)])
      }
    }
  }
}
Generated assembly code with VQMOVNB.S16 still have three loops. It allocates a large buffer for temporary data.
How to merge three loop to one?
Assembly code I want:
	vld1.8			{d16}, [r9:64], r1
	vld1.8			{d18}, [r4:64], r1
	vmull.s8		q1, d18, d16
	vqmovn.s16		d18, q1
	vst1.8			{d18}, [r5:64], lr