In many CUDA kernels, the conventional pattern for thread iteration looks like this:
for (int i = thread_idx; i < numel; i += num_threads)
out[i] = 0;
However, in TileLang we currently have to write:
for i in T.serial(0, T.ceildiv(numel - thread_idx, num_threads)):
j = thread_idx + i * num_threads
out[j] = -1
This is not only cumbersome — since it requires manually computing the range and performing index transformations — but it also introduces additional register usage and reduces index computation efficiency.
Introducing a step
attribute to ForNode
could simplify such patterns and improve both readability and performance but I guess there’s a lot of challenges about this part.