Is there any way to make autoscheduler generate a kernel with specified CUDA grid size?

I want autoscheduler to generate a kernel with grid size less than 40. My naive ideas is to add such a line of code to cuda_module.cc:

if (wl.grid_dim(0) > 40)

  result = cuLaunchKernel(fcache_[-999], wl.grid_dim(0), wl.grid_dim(1),

                                 wl.grid_dim(2), wl.block_dim(0), wl.block_dim(1),

                                 wl.block_dim(2), wl.dyn_shmem_size, strm, void_args, nullptr);

else{

  result = cuLaunchKernel(fcache_[device_id], wl.grid_dim(0), wl.grid_dim(1),

                                 wl.grid_dim(2), wl.block_dim(0), wl.block_dim(1),

                                 wl.block_dim(2), wl.dyn_shmem_size, strm, void_args, nullptr);

}

When a grid size from one schedule generated by AutoScheduler, kernel launch will cause segment fault. So I think after the autoscheduler is completed, the best grid size should not exceed 40, but the grid size after apply_best() is 96. Do you know where the problem is? If I want to realize my idea, how can I do it? Thanks!