Does tensorize support dynamic shapes?
I have a DSA-architecture NPU backend that supports vector addition computations of arbitrary sizes.
My current implementation performs loop transformations during the pass phase, converting the innermost loop into a fixed size of 32, then applying tensorize to the innermost loop. During codegen, this is replaced with a kernel function to perform vector addition with a vector length of 32.
However, the above approach has a problem: the innermost loop can only be tensorized with a fixed size of 32. If the input data shape is, for example, 1x33, it will not work properly.
Are there alternative approaches to solve this problem?
MAX_SIZE=32
@T.prim_func
def add(
a:T.handle,b:T.handle,c:T.handle
) -> None:
A = T.match_buffer(a, (MAX_SIZE,), offset_factor=1)
B = T.match_buffer(b, (MAX_SIZE,), offset_factor=1)
C = T.match_buffer(c, (MAX_SIZE,), offset_factor=1)
with T.block():
T.reads(A[0:MAX_SIZE], B[0:MAX_SIZE])
T.writes(C[0:MAX_SIZE])
for i in T.serial(MAX_SIZE):
with T.block():
vi = T.axis.remap("S", [i])
C[vi]=A[vi]+B[vi]
@T.prim_func
def add_impl(
a:T.handle,b:T.handle,c:T.handle
) -> None:
A = T.match_buffer(a, (MAX_SIZE,), offset_factor=1)
B = T.match_buffer(b, (MAX_SIZE,), offset_factor=1)
C = T.match_buffer(c, (MAX_SIZE,), offset_factor=1)
with T.block():
T.reads(A[0:MAX_SIZE], B[0:MAX_SIZE])
T.writes(C[0:MAX_SIZE])
T.call_packed(
"add_mm",
C.data, C.elem_offset,
B.data, B.elem_offset,
A.data, A.elem_offset,
MAX_SIZE
)
ADD_MM="add_mm"
TensorIntrin.register(ADD_MM, add, add_impl)