Does tensorize support dynamic shapes?

liubin · September 11, 2025, 11:19am

liubin · September 12, 2025, 5:34pm

I have a DSA-architecture NPU backend that supports vector addition computations of arbitrary sizes.

My current implementation performs loop transformations during the pass phase, converting the innermost loop into a fixed size of 32, then applying tensorize to the innermost loop. During codegen, this is replaced with a kernel function to perform vector addition with a vector length of 32.

However, the above approach has a problem: the innermost loop can only be tensorized with a fixed size of 32. If the input data shape is, for example, 1x33, it will not work properly.

Are there alternative approaches to solve this problem?

MAX_SIZE=32
@T.prim_func
def add(
    a:T.handle,b:T.handle,c:T.handle
) -> None:
    
A = T.match_buffer(a, (MAX_SIZE,), offset_factor=1)
    B = T.match_buffer(b, (MAX_SIZE,), offset_factor=1)
    C = T.match_buffer(c, (MAX_SIZE,), offset_factor=1)
    with T.block():
        T.reads(A[0:MAX_SIZE], B[0:MAX_SIZE])
        T.writes(C[0:MAX_SIZE])
        for i in T.serial(MAX_SIZE):
            with T.block():
                vi = T.axis.remap("S", [i])
                C[vi]=A[vi]+B[vi]
@T.prim_func
def add_impl(
    a:T.handle,b:T.handle,c:T.handle
) -> None:
    A = T.match_buffer(a, (MAX_SIZE,), offset_factor=1)
    B = T.match_buffer(b, (MAX_SIZE,), offset_factor=1)
    C = T.match_buffer(c, (MAX_SIZE,), offset_factor=1)
    with T.block():
        T.reads(A[0:MAX_SIZE], B[0:MAX_SIZE])
        T.writes(C[0:MAX_SIZE])

        T.call_packed(
            "add_mm",
            C.data,  C.elem_offset,
            B.data,  B.elem_offset,
            A.data,  A.elem_offset,
            MAX_SIZE
        )

ADD_MM="add_mm"
TensorIntrin.register(ADD_MM, add, add_impl)