CUDA_ERROR_NO_BINARY_FOR_GPU when trying to load a module on a different gpu

simonRelu · January 26, 2023, 9:40am

So I’ve compiled a model with cuda as target but it seems that this compiled model only works on gpus with the same architecture as the machine I build for. On other machines I get a CUDA_ERROR_NO_BINARY_FOR_GPU error. I thought since cuda is backwards compatible that if I build for the latest gpu it should work on all gpus. Is that not the case? Is there a way to do this without using opencl or vulkan?

Hzfengsy · January 26, 2023, 10:35am

You are right. We can only load the same model with the same GPU architecture.

However, this statement is not true, since:

New architecture brings new features (e.g. TensorCore with different shapes, async memory copy, etc.), which is not supported by the older cards;
To fully utilize the hardware resources and get good performance, we enable features according to the arch when you build/tune the model;
Even if we do not use arch features and only build with naive cuda code, the best-tuned result on Arch A usually is not the best on Arch B, because of the different architecture details.

In this case, I recommend you tune on your target environment.

simonRelu · January 26, 2023, 10:57am

Hey, thank you for the response.

The goal is that customers can run this model on their own hardware so we have to be able to target all Nvidia architectures. I believe cuda supports backwards compatibility with a JIT. To do this, I believe you can use the -gencode option in nvcc see: Matching CUDA arch and CUDA gencode for various NVIDIA architectures - Arnon Shimoni . Is it possible to set these compilation flags in tvm?

For performance, it is okay if the model is not perfectly optimised for every GPU as long as the execution time is the same order of magnitude.

simonRelu · January 26, 2023, 11:42am

I managed to compile a model for multiple architectures by overwriting the arch with

arch = [
        "-arch=sm_52",
        "-gencode=arch=compute_52,code=sm_52",
        "-gencode=arch=compute_60,code=sm_60",
        "-gencode=arch=compute_61,code=sm_61",
        "-gencode=arch=compute_70,code=sm_70",
        "-gencode=arch=compute_75,code=sm_75",
        "-gencode=arch=compute_80,code=sm_80",
        "-gencode=arch=compute_86,code=sm_86",
        "-gencode=arch=compute_86,code=compute_86",
    ]

In compile_cuda. Is their an option I can set to do this without modifying the code.

I also was wondering if I could optimise for different gpus at the same time to make sure it performs well on different gpus?

Hzfengsy · January 26, 2023, 11:43am

Not sure about the JIT mode of nvcc. Would be great if you can try it and show some performance comparisons between JIT mode and default mode.

The principle is that backward compatibility is obviously a good thing as long as it will not influence the peek performance for the current arch

simonRelu · January 26, 2023, 11:46am

I just tried this with the hack I explained above. The performance on the target GPU is similar (RTX 2080ti).

However it was about 50% slower on the RTX 3080. I believe this is not because of the backwards compatibility but because the model was overtuned towards one specific GPU. If I could tune with multiple different GPUs I might be able to create a compiled model that performs well on a range on gpus

Hzfengsy · January 26, 2023, 11:51am

Thanks for your information. I’m still a bit concerned since I see the following statement in your link

Sample flags for generation on CUDA 11.4 for best performance with RTX 3080 cards:

-arch=sm_80 \ 
-gencode=arch=compute_80,code=sm_80 \
-gencode=arch=compute_86,code=sm_86 \
-gencode=arch=compute_87,code=sm_87 \
-gencode=arch=compute_86,code=compute_86

So, I think we may need rigorous experiments to disproving this conclusion

Hzfengsy · January 26, 2023, 11:53am

cc @vinx13 @spectrometerHBH @junrushao if you are interested

vinx13 · January 26, 2023, 4:42pm

you can register and override this function to pass custom options to compile_cuda https://github.com/apache/tvm/blob/697fdb2cb7dc7ad07ed826f908390b88106cc98f/python/tvm/contrib/nvcc.py#L187

The best config for a gpu might not be optimal for another one, this is because they have different hardware characteristics. It is necessary to do some experiments to analyze the cause. For example, instead of trying loading the module to a different gpu, you can also try compile on gpu with tuned logs from another gpu.

simonRelu · January 27, 2023, 2:52pm

Thank you for the info. I’ve overwritten the nvcc compilation with:

def tvm_callback_cuda_compile(code):
    """use nvcc to generate fatbin code for better optimization"""
    arch = [
        "-arch=sm_52",
        "-gencode=arch=compute_52,code=sm_52",
        "-gencode=arch=compute_60,code=sm_60",
        "-gencode=arch=compute_61,code=sm_61",
        "-gencode=arch=compute_70,code=sm_70",
        "-gencode=arch=compute_75,code=sm_75",
        "-gencode=arch=compute_80,code=sm_80",
        "-gencode=arch=compute_86,code=sm_86",
        "-gencode=arch=compute_86,code=compute_86",
    ]
    ptx = compile_cuda(code, target_format="fatbin", arch=arch)
    return ptx


tvm._ffi.register_func(
    "tvm_callback_cuda_compile", tvm_callback_cuda_compile, override=True
)

This targets GTX 7XX to RTX 3XXX. I’ve tuned my model for the GTX 2080Ti with and without overwriting the tvm_callback_cuda_compile and I didn’t measure a meaningful difference (the multi architecture build was even 2% faster on average). The performance of this tuned model is about 10% slower than the pytorch equivalent but I only tuned it with num_measure_trials=2000 to save some time so it is possible that given more time it would improve more.

I then exported it as a library and ran it on a GTX 3080. Here it ran 50% slower than on the GTX 2080Ti and about 10x slower than the same model in pytorch. I tried compiling the model with tvm using the tuning logs from the GTX 2080Ti for just the GTX 3080 (without overwriting tvm_callback_cuda_compile). This was also about x10 slower than the pytorch model. I again couldn’t find a meaningful difference between the model compiled for just the GTX 3080 and the model compiled for multiple gpu architectures.

This seems to suggest that compiling for multiple GPUs doesn’t degrade performance. However, to make sure that a binary works well on different GPUs, it should also be tuned for these GPUs.

I believe a way to achieve this could be to measure the performance on multiple GPUs during the evolutionary search. I also believe it’s likely that the compiled model will also perform reasonably well on GPUs that weren’t included in the measurements since the resulting binary should be more GPU agnostic.

It could also be interesting to see how other libraries like cuDNN achieve good performance on a variety of GPUs. Maybe we can introduce some new mutations inspired by their approach. I’m not very familiar with this domain but would be interested in looking into this.

simonRelu · February 1, 2023, 8:43am

Hey I was wondering how I would best optimise a model for multiple GPUs. Is it possible to measure on multiple different GPUs, take the average, and use that as score in the evolutionary search?

Hzfengsy · February 1, 2023, 11:19am

There is no such mechanism. I recommend you tune on each target device and build different modules for different devices. If you still want only one module for multiple GPUs, vendor-provided libraries (cublas, cudnn) are good choices.