CUDA driver version is insufficient for CUDA runtime version

ResidentMario · December 29, 2020, 12:24am

nvidia-smi reports that my current environment (an NVIDIA T4 on AWS) has driver version 450.80.02 installed:

$ nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.80.02    Driver Version: 450.80.02    CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+

nvcc reports runtime (toolkit) version 10.0.130:

$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2018 NVIDIA Corporation
Built on Sat_Aug_25_21:08:01_CDT_2018
Cuda compilation tools, release 10.0, V10.0.130

The toolkit-to-driver-relationship chart in NVIDIA’s docs indicates that runtime version 10.0.130 requires driver version >=410.48. The relationship that 450.80.02 >= 410.48 is true, therefore, the system is in order.

After compiling TVM from source with CUDA support enabled, I attempted to run the following debug code:

import tvm
print(tvm.gpu(0).exist)
print(tvm.gpu(0).compute_version)

This prints False and raises:

Check failed: e == cudaSuccess || e == cudaErrorCudartUnloading == false: CUDA: CUDA driver version is insufficient for CUDA runtime version

I am unable to trace the line of reasoning being followed by this error message (specifically that CUDA driver version is insufficient for CUDA runtime version). That doesn’t seem correct to me. But am I perhaps missing something?

ResidentMario · December 29, 2020, 7:32pm

After some more tinkering I haven’t made any headway, and I think I’ve ruled out every possibility except for a bug in TVM itself. So I’ve filed an issue in the TVM GH issue tracker here. Will update this thread with the solution if one arrives in that thread.

junrushao · December 29, 2020, 7:56pm

nvidia-smi prints the cuda version as 11.0 but nvcc shows 10.0, which indicates some inconsistency between both

junrushao · December 29, 2020, 7:59pm

If you have a clean AWS T4 instance, you may follow the instructions on NVIDIA’s official website (e.g. https://developer.nvidia.com/cuda-downloads?target_os=Linux&target_arch=x86_64&target_distro=Ubuntu&target_version=2004&target_type=debnetwork) to install CUDA. It should come with the proper version of driver too so you don’t have to install your own.

junrushao · December 29, 2020, 8:10pm

Would you mind also posting the output from TVM’s cmake logs? Also would you like to post the result of ldd /path/to/tvm/build/libtvm.so?

Thanks a lot!

ResidentMario · December 29, 2020, 8:10pm

nvidia-smi prints the cuda version as 11.0 but nvcc shows 10.0, which indicates some inconsistency between both

Ah. I will admit that I don’t understand the meaning of the CUDA Version outputted by nvidia-smi (or more concretely, I don’t understand its relationship to the Linux driver version, e.g. 410.48 here).

Here is a gist with the full output of the install script, including the TVM cmake logs.

Here is a link to the install script.

junrushao · December 29, 2020, 8:14pm

Thanks for the info!

Would you mind typing the following commands?

ldd /path/to/tvm/build/libtvm.so

It will show which cuda runtime TVM is linked to

ResidentMario · December 29, 2020, 8:15pm

Yep (one moment please, the machine went down so I’m reinitializing it).

junrushao · December 29, 2020, 8:23pm

Yeah every time encountering “CUDA driver version is insufficient for CUDA runtime version” I have to reinstall CUDA, just want to double check and confirm it is the case

ResidentMario · December 29, 2020, 8:31pm

Here is the output of the ldd command (ldd /tmp/tvm/build/libtvm.so):

	linux-vdso.so.1 (0x00007fffe32d6000)
	libnvrtc.so.10.0 => /usr/local/cuda/lib64/libnvrtc.so.10.0 (0x00007ff75f21d000)
	libLLVM-6.0.so.1 => /usr/lib/llvm-6.0/lib/libLLVM-6.0.so.1 (0x00007ff75b781000)
	libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007ff75b57d000)
	libcudart.so.10.0 => /usr/local/cuda/lib64/libcudart.so.10.0 (0x00007ff75b303000)
	libcuda.so.1 => /usr/local/cuda/targets/x86_64-linux/lib/stubs/libcuda.so.1 (0x00007ff75b0f7000)
	libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007ff75aed8000)
	libstdc++.so.6 => /usr/lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007ff75ab4f000)
	libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007ff75a7b1000)
	libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007ff75a599000)
	libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007ff75a1a8000)
	/lib64/ld-linux-x86-64.so.2 (0x00007ff7620f1000)
	librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007ff759fa0000)
	libffi.so.6 => /usr/lib/x86_64-linux-gnu/libffi.so.6 (0x00007ff759d98000)
	libedit.so.2 => /usr/lib/x86_64-linux-gnu/libedit.so.2 (0x00007ff759b61000)
	libz.so.1 => /lib/x86_64-linux-gnu/libz.so.1 (0x00007ff759944000)
	libtinfo.so.5 => /lib/x86_64-linux-gnu/libtinfo.so.5 (0x00007ff75971a000)

Edit: cat /usr/local/cuda/version.txt says CUDA Version 10.0.130.

junrushao · December 29, 2020, 8:43pm

Thanks for the reply! It looks like they are using 10.0 everywhere while the system comes with a slightly different version…Have to admit it happens quite frequently but I don’t know why.

Would you like to follow the instructions here (https://developer.nvidia.com/cuda-downloads?target_os=Linux&target_arch=x86_64&target_distro=Ubuntu&target_version=1804&target_type=debnetwork) to reinstall cuda?

ResidentMario · December 29, 2020, 8:57pm

It looks like they are using 10.0 everywhere while the system comes with a slightly different version…

How can you tell?

Would you like to follow the instructions here (https://developer.nvidia.com/cuda-downloads?target_os=Linux&target_arch=x86_64&target_distro=Ubuntu&target_version=1804&target_type=debnetwork) to reinstall cuda?

Will attempt this and post the result in this thread in a few. Thanks for your help so far! I appreciate it.

FanShupei · March 28, 2021, 3:07pm

I encountered the same issue, on my own computer, not on AWS, and solved it. Maybe my experience is helpful to you.

May you check RPATH of libtvm.so, using chrpath -l libtvm.so or inspecting build.ninja

build libtvm.so: CXX_SHARED_LIBRARY_LINKER__tvm ...
  LANGUAGE_COMPILE_FLAGS = -std=c++14 -faligned-new -O2 -Wall -fPIC
  LINK_LIBRARIES = -Wl,-rpath,/usr/local/cuda-11.2/lib64:/usr/lib/llvm-10/lib:/usr/local/cuda-11.2/targets/x86_64-linux/lib/stubs:  ...
  OBJECT_DIR = CMakeFiles/tvm.dir
  POST_BUILD = :
  PRE_LINK = :
  SONAME = libtvm.so
  SONAME_FLAG = -Wl,-soname,
  TARGET_COMPILE_PDB = CMakeFiles/tvm.dir/
  TARGET_FILE = libtvm.so
  TARGET_PDB = libtvm.pdb

Look at it, /usr/local/cuda-11.2/targets/x86_64-linux/lib/stubs shoud not live in RPATH. After removing it from RPATH of libtvm.so and libtvm_runtime.so, everything works.

But when I run ldd ./libtvm.so, libcuda.so.1 is resolved correctly (with wrong RPATH). This lead me to the wrong way, wired.

ResidentMario · April 26, 2021, 9:06pm

We eventually upgraded our entire image with new versions of PyTorch and CUDA and the like, and something somewhere in that changeset resolved the problem.