CUDA_ERROR_INVALID_PTX when trying to run single conv2d layer after compilation

I am trying to run a model consisting of a single conv2d layer on a Jetson TX2. The layer is as follows:

input data shape = (1, 512, 16, 20)
num output channels = 256
kernel size = 5
stride = 1
padding = 2
bias = False

I am cross-compiling and using target="cuda" and target_host="llvm -target=aarch64-linux-gnu". Both the host machine on which I’m compiling and the TX2 are running cuda-8.0 and llvm-4.0.

tvm._ffi.base.TVMError: Except caught from RPC call: [20:21:10] /home/nvidia/tvm/src/runtime/module_util.cc:52: Check failed: ret == 0 (-1 vs. 0) [20:21:10] /home/nvidia/tvm/src/runtime/cuda/cuda_module.cc:91: CUDAError: cuModuleLoadData(&(module_[device_id]), data_.c_str()) failed with error: CUDA_ERROR_INVALID_PTX

If I reduce num input channels to 256 and num output channels to 128, the error does not appear and I am able to run the layer on the TX2 successfully.

I replicated this problem on my host machine running CUDA, so I do not suspect this to be a TX2 problem. Does anyone have suggestions on how I could further debug this?

The INVALID_PTX means to much shared/local memory usage.

The old schedules in current topi are stale. Do not try them.

Please refer to this tutorial for single layer performance https://docs.tvm.ai/tutorials/autotvm/tune_conv2d_cuda.html

For TX2, it is a little different, you should use RPC tracker and RPC server. You can find how to register your device to tracker in this tutorial https://docs.tvm.ai/tutorials/autotvm/tune_nnvm_arm.html#start-rpc-tracker

This happened to me because I had an incompatible NVIDIA driver and cuda toolkit, make sure you reboot after you installed CUDA.

Add a possible workaround for this issue. I could successfully run my codes with (sm_72 + cuda11.1 + RTX3090) and (sm_72 + cuda11.1 + RTX A5000). But it fails when using (sm_72 + cuda11.1 + V100), and throws the error above. Switching to a config (sm_70 + cuda11.1 + V100) fixes the problem. Not sure why sm_72 does not backward compatible for V100 in this case.