Do we have plan to introduce step attribute to ForNode?

In many CUDA kernels, the conventional pattern for thread iteration looks like this:

for (int i = thread_idx; i < numel; i += num_threads)
    out[i] = 0;

However, in TileLang we currently have to write:

for i in T.serial(0, T.ceildiv(numel - thread_idx, num_threads)):
    j = thread_idx + i * num_threads
    out[j] = -1

This is not only cumbersome β€” since it requires manually computing the range and performing index transformations β€” but it also introduces additional register usage and reduces index computation efficiency.

Introducing a step attribute to ForNode could simplify such patterns and improve both readability and performance but I guess there’s a lot of challenges about this part.

if there is demand, i thinkit is not a bad thing to have

I test cuda code like below and indeed get different inst sequence & register use counts. It is a surprise since backend compiler do not optimize them to the same binary codes :joy:.

__global__ void vecAdd(const float *A, const float *B, float *C, int n) {
    int tid = blockIdx.x * blockDim.x + threadIdx.x;
    int stride = blockDim.x * gridDim.x;
    for (int i = tid; i < n; i += stride) {
        C[i] = A[i] + B[i];
    }
}

__global__ void vecAdd2(const float *A, const float *B, float *C, int n) {
    int tid = blockIdx.x * blockDim.x + threadIdx.x;
    int stride = blockDim.x * gridDim.x;
    for (int j = 0; j < (n + stride - 1) / stride; ++j) {
        int i = tid + j * stride;
        C[i] = A[i] + B[i];
    }
}

So it seems to be good to support steped loop node. Is there already any (pre)rfcs about this thread? cc @LeiWang1999 @tqchen

1 Like

This is a draft for the issue.