I’d suggested to enhance For to support this pattern:
__global__
void saxpy(int n, float a, float *x, float *y)
{
for (int i = blockIdx.x * blockDim.x + threadIdx.x;
i < n;
i += blockDim.x * gridDim.x)
{
y[i] = a * x[i] + y[i];
}
}
@tqchen Do you think it is a good idea to support this kind of for expression in HalideIR? It’ll be helpful when we write some kernel with low level API
Is this really the case? codegen_c is only one of the many “backends” of TVM.
If I remember correctly, the For loops are normalized in the tvm.schedule.normalize() and it is (AFAIK) a simplification so that InferBound is easier.
Most likely we can use the same normalized loop to represent the same program, and low level program optimizer will detect such loop and rewrites to the strided version
for ( int i = 0; i < extent; i ++) {
y[i * stride + min] = a * x[i * stride + min]
}