[TIR] Any Infra to support Block Rasterization?

LeiWang1999 · January 17, 2024, 5:20am

Block rasterization is an important technique to improve L2 cache locality in GPUs. The implementation requires shuffling blockIdx before computation.

Say we currently have rasterization device function:

__device__ dim3 rasterization2DColumn(const int panel_width) {
    const auto baseBlockIdx = blockIdx.x + gridDim.x *blockIdx.y;
    const auto totalPanel = (gridDim.x * gridDim.y +panel_width * gridDim.x - 1) / (panel_width * gridDim.x);
    const auto totalBlock = gridDim.x * gridDim.y;
    const auto panelIdx = baseBlockIdx / (panel_width *gridDim.x);
    const auto strideLd = panelIdx + 1 < totalPanel ?panel_width : (totalBlock - panelIdx * (panel_width *gridDim.x)) / gridDim.x;
    const auto bx = (panelIdx & 1) ? gridDim.x -(baseBlockIdx - panelIdx * panel_width * gridDim.x) /strideLd - 1 : (baseBlockIdx - panelIdx * panel_width *gridDim.x) / strideLd;
    const auto by = (baseBlockIdx - panelIdx * panel_width *gridDim.x) % strideLd + panelIdx * panel_width;
    const auto bz = blockIdx.z;
    
    dim3 blockIdx(bx, by, bz);
    return blockIdx;
}

and a fragment code to invoke:

const dim3 blockIdx(rasterization2DColumn({});".format(self.panel_width_)

how can we inject the code into the generated c?

for the rasterization device function, we can use sch.annotate(main_block, ann_key="pragma_import_c", ann_val=device_func) to inject definitions. however, didn’t find any infra to inject the invokation:

main_kernel_0(){
// shuffle blockIdx to improve L2 Cache
const dim3 blockIdx(rasterization2DColumn({});
// original code follows
}

Any suggestions to handle this? maybe we can provide an annotate/progma to inject some str/(call expr) at some point of the ast.