Error: Cannot select intrinsic

I’m trying to use llvm intrinsics on a raspberry pi 4, but I get the error LLVM ERROR: Cannot select: intrinsic %llvm.arm.neon.vpadals when trying to build. Here is a test file:

import tvm
import tvm.auto_scheduler
from tvm.tir.ir_builder import create
import numpy as np

def intrin():
    def f(out):
        ib = create()
        r = tvm.tir.call_llvm_pure_intrin("int32x4", "llvm.arm.neon.vpadals.v4i32.v8i16", tvm.tir.const(2, "uint32"), tvm.tir.const(1, "int32x4"), tvm.tir.const(0, "int16x8"))
        ib.emit(out.vstore(0, r))
        return ib.get()
    out = tvm.tir.decl_buffer((4,), name="out", dtype="int32")
    return tvm.te.extern([out.shape], [], lambda ins, outs: f(outs[0]), out_buffers=[out], name="intrin", dtype="int32")

if __name__ == "__main__":
    target = tvm.target.Target("llvm -device=arm_cpu -model=bcm2711 -mtriple=aarch64-linux-gnu -mattr=+neon -mcpu=cortex-a72")
    out = intrin()
    s = tvm.te.create_schedule(out.op)
    f = tvm.build(s, [out], target=target)
    f(tvm.nd.array(np.zeros((4,), dtype="int16")))

I believe the vpadals instruction is available because I can compile and run a file using it from c:

#include <arm_neon.h>
int main() {
	int16x4_t a = {0, 1, 2, 3};
	int8x8_t b = {4, 5, 6, 7};
	int16x4_t r = vpadal_s8(a, b);
	printf("%d\n", r);
}

Any ideas why this error occurs?

@kparzysz it seems like you have experience with ARM and LLVM, maybe you can provide some insight?

If this is using AArch64 really on the Raspberry Pi4 (i.e. uname shows aarch64-linux on the command line)

See if using llvm.aarch64.neon.saddlp.v4i16.v8i8 helps ? Not an LLVM expert but I think the intrinsic you are using is suitable on AArch32 i.e. they are different backends for the AArch32 ISA and the AArch64 ISA.

Further the return type I think for vpadal_s8 would be int16x4_t so mapping it to int32x4 seems a bit off on first reading.

Ramana

saddlp works, thanks Ramana! Seems like saddlp is a slightly different instruction from vpadals. I wonder why vpadals is not available even though it is listed as available for A64 in arm’s documentation.

vpadals is not an AArch64 intrinsic (at least not in LLVM). For AArch64 LLVM has <4 x i32> @llvm.aarch64.neon.saddlp.v4i32.v8i16(<8 x i16>).

Sorry for bumping this topic, but I ran into the same issue described in the original post and seem to not quite understand how to implement what @ramana-arm and @kparzysz suggested earlier.

Like the previous answers suggested, and since @tkonolige also reported that this did work for them, I tried to test the code from the original post with the mentioned llvm.aarch64.neon.saddlp.v4i32.v8i16 intrinsic instead of the one used in the original code posted above:

import tvm
from tvm.tir.ir_builder import create
import numpy as np

def intrin():
    def f(out):
        ib = create()
        r = tvm.tir.call_llvm_pure_intrin("int32x4", "llvm.aarch64.neon.saddlp.v4i32.v8i16", tvm.tir.const(2, "uint32"), tvm.tir.const(1, "int32x4"), tvm.tir.const(0, "int16x8"))
        ib.emit(out.vstore(0, r))
        return ib.get()
    out = tvm.tir.decl_buffer((4,), name="out", dtype="int32")
    return tvm.te.extern([out.shape], [], lambda ins, outs: f(outs[0]), out_buffers=[out], name="intrin", dtype="int32")

if __name__ == "__main__":
    target = tvm.target.arm_cpu("rasp4b64")
    out = intrin()
    s = tvm.te.create_schedule(out.op)
    f = tvm.build(s, [out], target=target)
    f(tvm.nd.array(np.zeros((4,), dtype="int16")))

But when trying to run this script, I am now running into the following error while executing the tvm.build call:

Check failed: (f) is false: Cannot find intrinsic declaration, possible type mismatch: llvm.aarch64.neon.saddlp

I am still pretty new to TVM and especially using intrinsics, so it would be great if someone could maybe help me out here and shortly explain what I am doing wrong!