Do we have plan to introduce step attribute to ForNode?

LeiWang1999 · October 15, 2025, 6:10pm

In many CUDA kernels, the conventional pattern for thread iteration looks like this:

for (int i = thread_idx; i < numel; i += num_threads)
    out[i] = 0;

However, in TileLang we currently have to write:

for i in T.serial(0, T.ceildiv(numel - thread_idx, num_threads)):
    j = thread_idx + i * num_threads
    out[j] = -1

This is not only cumbersome — since it requires manually computing the range and performing index transformations — but it also introduces additional register usage and reduces index computation efficiency.

Introducing a step attribute to ForNode could simplify such patterns and improve both readability and performance but I guess there’s a lot of challenges about this part.

tqchen · October 15, 2025, 6:35pm

if there is demand, i thinkit is not a bad thing to have

wrongtest · November 4, 2025, 10:01am

I test cuda code like below and indeed get different inst sequence & register use counts. It is a surprise since backend compiler do not optimize them to the same binary codes .

__global__ void vecAdd(const float *A, const float *B, float *C, int n) {
    int tid = blockIdx.x * blockDim.x + threadIdx.x;
    int stride = blockDim.x * gridDim.x;
    for (int i = tid; i < n; i += stride) {
        C[i] = A[i] + B[i];
    }
}

__global__ void vecAdd2(const float *A, const float *B, float *C, int n) {
    int tid = blockIdx.x * blockDim.x + threadIdx.x;
    int stride = blockDim.x * gridDim.x;
    for (int j = 0; j < (n + stride - 1) / stride; ++j) {
        int i = tid + j * stride;
        C[i] = A[i] + B[i];
    }
}

So it seems to be good to support steped loop node. Is there already any (pre)rfcs about this thread? cc @LeiWang1999 @tqchen

wrongtest · November 5, 2025, 6:58am

github.com/apache/tvm

[TIR] Add step attribute to ForNode (Initial codes)

main ← wrongtest-intellif:demo_fornode_step_support

opened 06:55AM - 05 Nov 25 UTC

wrongtest-intellif

+303 -129

A prototyping change to add `ForNode::step` It try add minimal codes to run f…ollowing naive test cases. ```python import tvm import numpy as np from tvm.script import tir as T @T.prim_func def function(A: T.Buffer[(1024)], B: T.Buffer[(1024)], C: T.Buffer[(1024)]): for i in range(0, 100, 3): C[i] = A[i] + B[i] print(function) lib = tvm.compile(function, target="c") print(lib.mod.inspect_source()) lib2 = tvm.compile(function, target="llvm") a = np.random.uniform(1, 100, [1024]).astype("float32") b = np.random.uniform(1, 100, [1024]).astype("float32") c = np.zeros([1024]).astype("float32") lib(a, b, c) c[:] = 0 print(c[:]) lib2(a, b, c) print(c[:]) ``` The aspects to check for a real roadmap may be 1. Roundtrip support for TIR tvmscript grammar 2. Correctness of TIR lowering pipeline - For **all transformations and analysis tools**, either it make adaptions to non-consecutive loop iteration indices, or loop canonicalization required. - Ensure the original `ForNode::step` is not dropped by mutations on `ForNode`. 3. Correctness of TensorIR schedule and MetaSchedule - Since many primitives depend on affine bindings. Loop canonicalization is required. 4. CodeGen support - Check mainstream targets could support the loop step. 5. Compatibility issues - Try to argue that the change would not affect existing works, since `ForNode` is an important construction in TVM.

This is a draft for the issue.