With flag -mattr=+mve
generate VQMOVNB.S16
instead of VQMOVN.S16
.
VQMOVNB.S16
is not supported by my target device.
My compile target define:
target_t = 'llvm -device=arm_cpu -model=bcm2835 -mtriple=armv7a-linux-gnueabihf -mattr=+neon,+mve'
--------------------------- split line -----------------------
My main goal is efficiently execute quantized model.
print(tvm.lower(s, [A, B], simple_mode=True))
output:
primfn(A_1: handle, B_1: handle) -> ()
attr = {"global_symbol": "main", "tir.noalias": True}
buffers = {B: Buffer(B_2: Pointer(int8), int8, [224, 224], []),
A: Buffer(A_2: Pointer(int8), int8, [224, 224], [])}
buffer_map = {A_1: A, B_1: B} {
attr [C: Pointer(int16)] "storage_scope" = "global";
allocate(C, int16, [50176]);
attr [compute: Pointer(int8)] "storage_scope" = "global";
allocate(compute, int8, [50176]) {
for (x: int32, 0, 224) {
for (y: int32, 0, 224) {
C[((x*224) + y)] = (cast(int16, (int8*)A_2[((x*224) + y)])*cast(int16, (int8*)B_2[((x*224) + y)]))
}
}
for (i0: int32, 0, 224) {
for (i1: int32, 0, 224) {
C[((i0*224) + i1)] = max(min((int16*)C[((i0*224) + i1)], 127i16), -128i16)
}
}
for (i0_1: int32, 0, 224) {
for (i1_1: int32, 0, 224) {
compute[((i0_1*224) + i1_1)] = cast(int8, (int16*)C[((i0_1*224) + i1_1)])
}
}
}
}
Generated assembly code with VQMOVNB.S16
still have three loops. It allocates a large buffer for temporary data.
How to merge three loop to one?
Assembly code I want:
vld1.8 {d16}, [r9:64], r1
vld1.8 {d18}, [r4:64], r1
vmull.s8 q1, d18, d16
vqmovn.s16 d18, q1
vst1.8 {d18}, [r5:64], lr