[TIR] Any Infra to support Block Rasterization?

Block rasterization is an important technique to improve L2 cache locality in GPUs. The implementation requires shuffling blockIdx before computation.

Say we currently have rasterization device function:

__device__ dim3 rasterization2DColumn(const int panel_width) {
    const auto baseBlockIdx = blockIdx.x + gridDim.x *blockIdx.y;
    const auto totalPanel = (gridDim.x * gridDim.y +panel_width * gridDim.x - 1) / (panel_width * gridDim.x);
    const auto totalBlock = gridDim.x * gridDim.y;
    const auto panelIdx = baseBlockIdx / (panel_width *gridDim.x);
    const auto strideLd = panelIdx + 1 < totalPanel ?panel_width : (totalBlock - panelIdx * (panel_width *gridDim.x)) / gridDim.x;
    const auto bx = (panelIdx & 1) ? gridDim.x -(baseBlockIdx - panelIdx * panel_width * gridDim.x) /strideLd - 1 : (baseBlockIdx - panelIdx * panel_width *gridDim.x) / strideLd;
    const auto by = (baseBlockIdx - panelIdx * panel_width *gridDim.x) % strideLd + panelIdx * panel_width;
    const auto bz = blockIdx.z;
    
    dim3 blockIdx(bx, by, bz);
    return blockIdx;
}

and a fragment code to invoke:

const dim3 blockIdx(rasterization2DColumn({});".format(self.panel_width_)

how can we inject the code into the generated c?

for the rasterization device function, we can use sch.annotate(main_block, ann_key="pragma_import_c", ann_val=device_func) to inject definitions. however, didn’t find any infra to inject the invokation:

main_kernel_0(){
// shuffle blockIdx to improve L2 Cache
const dim3 blockIdx(rasterization2DColumn({});
// original code follows
}

Any suggestions to handle this? maybe we can provide an annotate/progma to inject some str/(call expr) at some point of the ast.