Block rasterization is an important technique to improve L2 cache locality in GPUs. The implementation requires shuffling blockIdx
before computation.
Say we currently have rasterization device function:
__device__ dim3 rasterization2DColumn(const int panel_width) {
const auto baseBlockIdx = blockIdx.x + gridDim.x *blockIdx.y;
const auto totalPanel = (gridDim.x * gridDim.y +panel_width * gridDim.x - 1) / (panel_width * gridDim.x);
const auto totalBlock = gridDim.x * gridDim.y;
const auto panelIdx = baseBlockIdx / (panel_width *gridDim.x);
const auto strideLd = panelIdx + 1 < totalPanel ?panel_width : (totalBlock - panelIdx * (panel_width *gridDim.x)) / gridDim.x;
const auto bx = (panelIdx & 1) ? gridDim.x -(baseBlockIdx - panelIdx * panel_width * gridDim.x) /strideLd - 1 : (baseBlockIdx - panelIdx * panel_width *gridDim.x) / strideLd;
const auto by = (baseBlockIdx - panelIdx * panel_width *gridDim.x) % strideLd + panelIdx * panel_width;
const auto bz = blockIdx.z;
dim3 blockIdx(bx, by, bz);
return blockIdx;
}
and a fragment code to invoke:
const dim3 blockIdx(rasterization2DColumn({});".format(self.panel_width_)
how can we inject the code into the generated c?
for the rasterization device function, we can use sch.annotate(main_block, ann_key="pragma_import_c", ann_val=device_func)
to inject definitions. however, didn’t find any infra to inject the invokation:
main_kernel_0(){
// shuffle blockIdx to improve L2 Cache
const dim3 blockIdx(rasterization2DColumn({});
// original code follows
}
Any suggestions to handle this? maybe we can provide an annotate/progma to inject some str/(call expr) at some point of the ast.