Do we have plan to introduce step attribute to ForNode?

In many CUDA kernels, the conventional pattern for thread iteration looks like this:

for (int i = thread_idx; i < numel; i += num_threads)
    out[i] = 0;

However, in TileLang we currently have to write:

for i in T.serial(0, T.ceildiv(numel - thread_idx, num_threads)):
    j = thread_idx + i * num_threads
    out[j] = -1

This is not only cumbersome — since it requires manually computing the range and performing index transformations — but it also introduces additional register usage and reduces index computation efficiency.

Introducing a step attribute to ForNode could simplify such patterns and improve both readability and performance but I guess there’s a lot of challenges about this part.

if there is demand, i thinkit is not a bad thing to have