How to handle partial tile

xqdan · June 30, 2018, 4:02am

See case below, we need to do a vector add on UB buffer, so we need to copy the data to UB first, and then do vector add, then copy back to global memory.
def vec_add():
n = 99991
A = tvm.placeholder((n, ), name=‘A’)
B = tvm.placeholder((n, ), name=‘B’)
T = tvm.compute((n, ), lambda i: A[i]+B[i], name=“T”)
Ab = tvm.decl_buffer(A.shape, A.dtype, name=‘A’)
Bb = tvm.decl_buffer(B.shape, B.dtype, name=‘B’)
Tb = tvm.decl_buffer(T.shape, T.dtype, name=‘T’)
s = tvm.create_schedule(T.op)
A1 = s.cache_read(A, “local.UB”, [T])
B1 = s.cache_read(B, “local.UB”, [T])
T1 = s.cache_write(T, “local.UB”)
xo, xi = s[T].split(T.op.axis[0], 4)
bounds = tvm.schedule.InferBound(s)
stmt = tvm.schedule.ScheduleOps(s, bounds)

the question is when we get a big array for A and B, larger than the size of UB, so we need to do split schedule here. that’s fine with some divided number like 1024, but for some prime number, like 99991, we’ll get some if stmt in ir, which will cause instruction translate failure, like tensorize, dma copy.

So what’s the best solution for this case?

Thanks,

tqchen · June 30, 2018, 2:36pm

The best way to do so really depends on the target hardware instruction available. We should have proposals listed to this issue and some discussions to map the high level primitives to solve the case, here are a few candidates:

Predication(some processor support predication of a region), which execute certain things in parallel while restrict the compute region to a dynamic condition
Padding, make the schedule to automatically pad up the tails
Loop split, works well for cases like CPU and GPU when scalar unit is available

xqdan · July 2, 2018, 11:20am

conditional execution instructions are common for DSP or CPU, we don’t see these for NPU. If we have it, we need to enhance tvm to support codegen if stmt;
looks like padding is best way, just a little bit redundant computation. do you have any detailed idea on tvm for padding?
for loop split, do you mean loop peel? we can peel a tail for irregular shape, so we have two stages, one can further be split, one no need to split anymore.