nvidia-smi
reports that my current environment (an NVIDIA T4 on AWS) has driver version 450.80.02
installed:
$ nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.80.02 Driver Version: 450.80.02 CUDA Version: 11.0 |
|-------------------------------+----------------------+----------------------+
nvcc
reports runtime (toolkit) version 10.0.130:
$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2018 NVIDIA Corporation
Built on Sat_Aug_25_21:08:01_CDT_2018
Cuda compilation tools, release 10.0, V10.0.130
The toolkit-to-driver-relationship chart in NVIDIA’s docs indicates that runtime version 10.0.130
requires driver version >=410.48
. The relationship that 450.80.02 >= 410.48
is true, therefore, the system is in order.
After compiling TVM from source with CUDA support enabled, I attempted to run the following debug code:
import tvm
print(tvm.gpu(0).exist)
print(tvm.gpu(0).compute_version)
This prints False
and raises:
Check failed: e == cudaSuccess || e == cudaErrorCudartUnloading == false: CUDA: CUDA driver version is insufficient for CUDA runtime version
I am unable to trace the line of reasoning being followed by this error message (specifically that CUDA driver version is insufficient for CUDA runtime version
). That doesn’t seem correct to me. But am I perhaps missing something?
After some more tinkering I haven’t made any headway, and I think I’ve ruled out every possibility except for a bug in TVM itself. So I’ve filed an issue in the TVM GH issue tracker here. Will update this thread with the solution if one arrives in that thread.
nvidia-smi prints the cuda version as 11.0 but nvcc shows 10.0, which indicates some inconsistency between both
If you have a clean AWS T4 instance, you may follow the instructions on NVIDIA’s official website (e.g. https://developer.nvidia.com/cuda-downloads?target_os=Linux&target_arch=x86_64&target_distro=Ubuntu&target_version=2004&target_type=debnetwork) to install CUDA. It should come with the proper version of driver too so you don’t have to install your own.
Would you mind also posting the output from TVM’s cmake logs? Also would you like to post the result of ldd /path/to/tvm/build/libtvm.so
?
Thanks a lot!
nvidia-smi prints the cuda version as 11.0 but nvcc shows 10.0, which indicates some inconsistency between both
Ah. I will admit that I don’t understand the meaning of the CUDA Version
outputted by nvidia-smi
(or more concretely, I don’t understand its relationship to the Linux driver version, e.g. 410.48
here).
Here is a gist with the full output of the install script, including the TVM cmake logs.
Here is a link to the install script.
1 Like
Thanks for the info!
Would you mind typing the following commands?
ldd /path/to/tvm/build/libtvm.so
It will show which cuda runtime TVM is linked to
Yep (one moment please, the machine went down so I’m reinitializing it).
Yeah every time encountering “CUDA driver version is insufficient for CUDA runtime version” I have to reinstall CUDA, just want to double check and confirm it is the case
Here is the output of the ldd
command (ldd /tmp/tvm/build/libtvm.so
):
linux-vdso.so.1 (0x00007fffe32d6000)
libnvrtc.so.10.0 => /usr/local/cuda/lib64/libnvrtc.so.10.0 (0x00007ff75f21d000)
libLLVM-6.0.so.1 => /usr/lib/llvm-6.0/lib/libLLVM-6.0.so.1 (0x00007ff75b781000)
libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007ff75b57d000)
libcudart.so.10.0 => /usr/local/cuda/lib64/libcudart.so.10.0 (0x00007ff75b303000)
libcuda.so.1 => /usr/local/cuda/targets/x86_64-linux/lib/stubs/libcuda.so.1 (0x00007ff75b0f7000)
libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007ff75aed8000)
libstdc++.so.6 => /usr/lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007ff75ab4f000)
libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007ff75a7b1000)
libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007ff75a599000)
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007ff75a1a8000)
/lib64/ld-linux-x86-64.so.2 (0x00007ff7620f1000)
librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007ff759fa0000)
libffi.so.6 => /usr/lib/x86_64-linux-gnu/libffi.so.6 (0x00007ff759d98000)
libedit.so.2 => /usr/lib/x86_64-linux-gnu/libedit.so.2 (0x00007ff759b61000)
libz.so.1 => /lib/x86_64-linux-gnu/libz.so.1 (0x00007ff759944000)
libtinfo.so.5 => /lib/x86_64-linux-gnu/libtinfo.so.5 (0x00007ff75971a000)
Edit: cat /usr/local/cuda/version.txt
says CUDA Version 10.0.130
.
1 Like
Thanks for the reply! It looks like they are using 10.0 everywhere while the system comes with a slightly different version…Have to admit it happens quite frequently but I don’t know why.
Would you like to follow the instructions here (https://developer.nvidia.com/cuda-downloads?target_os=Linux&target_arch=x86_64&target_distro=Ubuntu&target_version=1804&target_type=debnetwork) to reinstall cuda?
1 Like
It looks like they are using 10.0 everywhere while the system comes with a slightly different version…
How can you tell?
Would you like to follow the instructions here (CUDA Toolkit 12.9 Downloads | NVIDIA Developer) to reinstall cuda?
Will attempt this and post the result in this thread in a few. Thanks for your help so far! I appreciate it.
1 Like
I encountered the same issue, on my own computer, not on AWS, and solved it. Maybe my experience is helpful to you.
May you check RPATH of libtvm.so, using chrpath -l libtvm.so
or inspecting build.ninja
build libtvm.so: CXX_SHARED_LIBRARY_LINKER__tvm ...
LANGUAGE_COMPILE_FLAGS = -std=c++14 -faligned-new -O2 -Wall -fPIC
LINK_LIBRARIES = -Wl,-rpath,/usr/local/cuda-11.2/lib64:/usr/lib/llvm-10/lib:/usr/local/cuda-11.2/targets/x86_64-linux/lib/stubs: ...
OBJECT_DIR = CMakeFiles/tvm.dir
POST_BUILD = :
PRE_LINK = :
SONAME = libtvm.so
SONAME_FLAG = -Wl,-soname,
TARGET_COMPILE_PDB = CMakeFiles/tvm.dir/
TARGET_FILE = libtvm.so
TARGET_PDB = libtvm.pdb
Look at it, /usr/local/cuda-11.2/targets/x86_64-linux/lib/stubs
shoud not live in RPATH. After removing it from RPATH of libtvm.so
and libtvm_runtime.so
, everything works.
But when I run ldd ./libtvm.so
, libcuda.so.1 is resolved correctly (with wrong RPATH). This lead me to the wrong way, wired.
3 Likes
We eventually upgraded our entire image with new versions of PyTorch and CUDA and the like, and something somewhere in that changeset resolved the problem.