I want autoscheduler to generate a kernel with grid size less than 40. My naive ideas is to add such a line of code to cuda_module.cc:
if (wl.grid_dim(0) > 40)
result = cuLaunchKernel(fcache_[-999], wl.grid_dim(0), wl.grid_dim(1), wl.grid_dim(2), wl.block_dim(0), wl.block_dim(1), wl.block_dim(2), wl.dyn_shmem_size, strm, void_args, nullptr); else{ result = cuLaunchKernel(fcache_[device_id], wl.grid_dim(0), wl.grid_dim(1), wl.grid_dim(2), wl.block_dim(0), wl.block_dim(1), wl.block_dim(2), wl.dyn_shmem_size, strm, void_args, nullptr); }
When a grid size from one schedule generated by AutoScheduler, kernel launch will cause segment fault. So I think after the autoscheduler is completed, the best grid size should not exceed 40, but the grid size after apply_best() is 96. Do you know where the problem is? If I want to realize my idea, how can I do it? Thanks!