See case below, we need to do a vector add on UB buffer, so we need to copy the data to UB first, and then do vector add, then copy back to global memory.
def vec_add():
n = 99991
A = tvm.placeholder((n, ), name=‘A’)
B = tvm.placeholder((n, ), name=‘B’)
T = tvm.compute((n, ), lambda i: A[i]+B[i], name=“T”)
Ab = tvm.decl_buffer(A.shape, A.dtype, name=‘A’)
Bb = tvm.decl_buffer(B.shape, B.dtype, name=‘B’)
Tb = tvm.decl_buffer(T.shape, T.dtype, name=‘T’)
s = tvm.create_schedule(T.op)
A1 = s.cache_read(A, “local.UB”, [T])
B1 = s.cache_read(B, “local.UB”, [T])
T1 = s.cache_write(T, “local.UB”)
xo, xi = s[T].split(T.op.axis[0], 4)
bounds = tvm.schedule.InferBound(s)
stmt = tvm.schedule.ScheduleOps(s, bounds)
the question is when we get a big array for A and B, larger than the size of UB, so we need to do split schedule here. that’s fine with some divided number like 1024, but for some prime number, like 99991, we’ll get some if stmt in ir, which will cause instruction translate failure, like tensorize, dma copy.
So what’s the best solution for this case?
Thanks,