In many CUDA kernels, the conventional pattern for thread iteration looks like this:
for (int i = thread_idx; i < numel; i += num_threads)
out[i] = 0;
However, in TileLang we currently have to write:
for i in T.serial(0, T.ceildiv(numel - thread_idx, num_threads)):
j = thread_idx + i * num_threads
out[j] = -1
This is not only cumbersome β since it requires manually computing the range and performing index transformations β but it also introduces additional register usage and reduces index computation efficiency.
Introducing a step attribute to ForNode could simplify such patterns and improve both readability and performance but I guess thereβs a lot of challenges about this part.
I test cuda code like below and indeed get different inst sequence & register use counts. It is a surprise since backend compiler do not optimize them to the same binary codes .
__global__ void vecAdd(const float *A, const float *B, float *C, int n) {
int tid = blockIdx.x * blockDim.x + threadIdx.x;
int stride = blockDim.x * gridDim.x;
for (int i = tid; i < n; i += stride) {
C[i] = A[i] + B[i];
}
}
__global__ void vecAdd2(const float *A, const float *B, float *C, int n) {
int tid = blockIdx.x * blockDim.x + threadIdx.x;
int stride = blockDim.x * gridDim.x;
for (int j = 0; j < (n + stride - 1) / stride; ++j) {
int i = tid + j * stride;
C[i] = A[i] + B[i];
}
}
So it seems to be good to support steped loop node. Is there already any (pre)rfcs about this thread? cc @LeiWang1999@tqchen