[SOLVED] NMS very slow on CUDA

gasgallo · May 20, 2020, 3:04am

I’m currently benchmarking a model which uses NMS with CUDA target. After auto-tuning, 98% of inference time is due NMS operator (1.8s on nvidia 1050ti). Do we use an optimized CUDA implementation of NMS?

kevinthesun · May 19, 2020, 7:22am

Set USE_THRUST when building tvm. This enables cuda thrust which hugely accelerates NMS.

gasgallo · May 19, 2020, 9:08am

I didn’t know about this, I’ll give it a try, thanks!

gasgallo · May 20, 2020, 3:03am

@kevinthesun it worked, thank you! Now the same NMS call takes around 2ms.

To make THRUST work, I had to modify few things though. I’ll document the process here in case someone else will face a similar issue:

to build with THRUST, I had to upgrade CMAKE to >= 3.13 and this seems to break CUDA in TVM.
whenever I use CMAKE 3.13 or newer, with or without USE_THRUST, I get the following error when trying to run inference on a previously compiled model.

CUDA: Check failed: e == cudaSuccess || e == cudaErrorCudartUnloading: unknown error

To fix thist error, you need to have access to nvidia driver, that I hadn’t from within the docker container I was using. Once running the inference from an environment that have access to the driver (you can check if nvidia-smi command works), inference works fine.

or I get the following error when trying to compile a model using relay:

ValueError: arch(sm_xy) is not passed, and we cannot detect it from env

To fix this error, you can refer to [SOLVED] Compile error related to autotvm.

nolanliou · November 13, 2020, 8:59am

Met same error when open THRUST, even though have access to the driver (root and nvidia-smi works).

CUDA: Check failed: e == cudaSuccess || e == cudaErrorCudartUnloading: unknown error