I’m currently benchmarking a model which uses NMS with CUDA target. After auto-tuning, 98% of inference time is due NMS operator (1.8s on nvidia 1050ti). Do we use an optimized CUDA implementation of NMS?
Set USE_THRUST when building tvm. This enables cuda thrust which hugely accelerates NMS.
I didn’t know about this, I’ll give it a try, thanks!
@kevinthesun it worked, thank you! Now the same NMS call takes around 2ms
.
To make THRUST work, I had to modify few things though. I’ll document the process here in case someone else will face a similar issue:
- to build with THRUST, I had to upgrade CMAKE to >= 3.13 and this seems to break CUDA in TVM.
- whenever I use CMAKE 3.13 or newer, with or without
USE_THRUST
, I get the following error when trying to run inference on a previously compiled model.
CUDA: Check failed: e == cudaSuccess || e == cudaErrorCudartUnloading: unknown error
To fix thist error, you need to have access to nvidia driver, that I hadn’t from within the docker container I was using. Once running the inference from an environment that have access to the driver (you can check if nvidia-smi
command works), inference works fine.
- or I get the following error when trying to compile a model using relay:
ValueError: arch(sm_xy) is not passed, and we cannot detect it from env
To fix this error, you can refer to [SOLVED] Compile error related to autotvm.
Met same error when open THRUST
, even though have access to the driver (root and nvidia-smi
works).
CUDA: Check failed: e == cudaSuccess || e == cudaErrorCudartUnloading: unknown error