Question about TensorExpression

I got a model that uses the “take” op in tvm. The codegen is ok, but kernel launch failed with the following error:

Check failed: ret == 0 (-1 vs. 0) : TVMError: CUDALaunch Error: CUDA_ERROR_LAUNCH_OUT_OF_RESOURCES grid=(98,1,1), block=(1024,1,1)

Seems each of the cuda thread uses too may registers and then there are no enough registers for all threads.

I’m trying to solve the problem by launching less threads and make each thread do more work. I notice that an op schedule can be like this on GPU:

bx, tx = s[C].split(C.op.axis[0], factor=64)
s[C].bind(bx, te.thread_axis("blockIdx.x"))
s[C].bind(tx, te.thread_axis("threadIdx.x"))

then you can generate code with threads num equal to the element to be computed. and each thread only computes one element.

How can I implement the compute with less threads, for example , the previous schedule generates cuda code with grid=(98,1,1), block=(1024,1,1), with each thread computes only one elment.

I want a kernel with grid (49, 1, 1) and block = (512, 1, 1), with each thread computes 4 elements.

Seems the split method can only split an axis into 2 parts, but I need to split it into 3 parts, the first part as for loop in a thread, and the second part binds to blockIdx.x , and the 3rd one binds to threadIdx.x.

Is that supported in TE?Or any advice?

What about calling split twice?

1 Like

Great! It works. Actually I’m not familiar with the split input type. Thank you!