Autotvm with cudnn comes out TVMError Check failed: e == CUDNN_STATUS_SUCCESS (2 vs. 0) cuDNN: CUDNN_STATUS_ALLOC_FAILED

Hi, i’m auto-tuning an inception-v3 model to compare the performance for nvidia gpu vs tensorflow go, the version of tf i’m using is 1.4.1 in a ubuntu16.04 docker image( docker pull tensorflow/tensorflow:1.4.1-devel-gpu-py3 specifically), with cuda 8.0, cudnn 6, gcc 5.4.0 and llvm 6,
i download the latest tvm and follow the installation guide, the py from tutorials goes fine except some warning messages like “WARNING:autotvm:Cannot find config for target=cuda …”, so i take a try of anto-tuning with target = tvm.target.create(‘cuda -libs=cudnn -model=p40’), then i get this:

[09:47:40] /usr/tvm/src/contrib/cudnn/conv_forward.cc:243: CUDNN Found 8 fwd algorithms, choosing CUDNN_CONVOLUTION_FWD_ALGO_IMPLICIT_GEMM
[09:47:40] /usr/tvm/src/contrib/cudnn/conv_forward.cc:246: 0) CUDNN_CONVOLUTION_FWD_ALGO_IMPLICIT_GEMM - time: 0.270336 ms, Memory: 0
[09:47:40] /usr/tvm/src/contrib/cudnn/conv_forward.cc:246: 1) CUDNN_CONVOLUTION_FWD_ALGO_GEMM - time: 0.272384 ms, Memory: 524288
[09:47:40] /usr/tvm/src/contrib/cudnn/conv_forward.cc:246: 2) CUDNN_CONVOLUTION_FWD_ALGO_IMPLICIT_PRECOMP_GEMM - time: 0.344064 ms, Memory: 392
[09:47:40] /usr/tvm/src/contrib/cudnn/conv_forward.cc:246: 3) CUDNN_CONVOLUTION_FWD_ALGO_FFT_TILING - time: 0.847872 ms, Memory: 55914912
[09:47:40] /usr/tvm/src/contrib/cudnn/conv_forward.cc:246: 4) CUDNN_CONVOLUTION_FWD_ALGO_FFT - time: 22.5925 ms, Memory: 912261120
[09:47:40] /usr/tvm/src/contrib/cudnn/conv_forward.cc:246: 5) CUDNN_CONVOLUTION_FWD_ALGO_DIRECT - time: -1 ms, Memory: 0
[09:47:40] /usr/tvm/src/contrib/cudnn/conv_forward.cc:246: 6) CUDNN_CONVOLUTION_FWD_ALGO_WINOGRAD - time: -1 ms, Memory: 0
[09:47:40] /usr/tvm/src/contrib/cudnn/conv_forward.cc:246: 7) CUDNN_CONVOLUTION_FWD_ALGO_WINOGRAD_NONFUSED - time: -1 ms, Memory: 0
[Task 1/43] Current/Best: 0.00/ 0.00 GFLOPS | Progress: (1/10) | 0.95 s Done.
multiprocessing.pool.RemoteTraceback:
“”"
Traceback (most recent call last):
File “/usr/lib/python3.5/multiprocessing/pool.py”, line 119, in worker
result = (True, func(*args, **kwds))
File “/usr/lib/python3.5/multiprocessing/pool.py”, line 44, in mapstar
return list(map(*args))
File “/usr/local/lib/python3.5/dist-packages/tvm-0.5.dev0-py3.5-linux-x86_64.egg/tvm/autotvm/tuner/xgboost_cost_model.py”, line 326, in _extract_itervar_feature_log
sch, args = inp.task.instantiate(config)
File “/usr/local/lib/python3.5/dist-packages/tvm-0.5.dev0-py3.5-linux-x86_64.egg/tvm/autotvm/task/task.py”, line 65, in instantiate
sch, arg_bufs = self.func(*self.args, **self.kwargs)
File “/usr/local/lib/python3.5/dist-packages/tvm-0.5.dev0-py3.5-linux-x86_64.egg/tvm/autotvm/task/topi_integration.py”, line 133, in _topi_nn_conv2d
C = topi.nn.conv2d(*args, **kwargs)
File “”, line 2, in conv2d
File “/usr/local/lib/python3.5/dist-packages/tvm-0.5.dev0-py3.5-linux-x86_64.egg/tvm/target.py”, line 356, in dispatch_func
return dispatch_dict[k](*args, **kwargs)
File “”, line 2, in config_dispatcher
File “/usr/local/lib/python3.5/dist-packages/tvm-0.5.dev0-py3.5-linux-x86_64.egg/tvm/autotvm/task/dispatcher.py”, line 204, in dispatch_func
return dispatch_dict[cfg.template_key](cfg, *args, **kwargs)
File “/usr/local/lib/python3.5/dist-packages/tvm-0.5.dev0-py3.5-linux-x86_64.egg/tvm/autotvm/task/topi_integration.py”, line 267, in template_call
node = f(cfg, *args, **kwargs)
File “/usr/local/lib/python3.5/dist-packages/topi-0.5.dev0-py3.5.egg/topi/cuda/conv2d.py”, line 86, in conv2d_cuda
algo=-1) # let CUDNN choose the best algo
File “/usr/local/lib/python3.5/dist-packages/tvm-0.5.dev0-py3.5-linux-x86_64.egg/tvm/contrib/cudnn.py”, line 353, in conv2d_forward
oshape)
File “/usr/local/lib/python3.5/dist-packages/tvm-0.5.dev0-py3.5-linux-x86_64.egg/tvm/contrib/cudnn.py”, line 284, in conv2d_find_algo
int(y_shape[3]))
File “tvm/_ffi/_cython/./function.pxi”, line 286, in tvm._ffi._cy3.core.FunctionBase.call
File “tvm/_ffi/_cython/./function.pxi”, line 231, in tvm._ffi._cy3.core.FuncCall
File “tvm/_ffi/_cython/./base.pxi”, line 151, in tvm._ffi._cy3.core.CALL
tvm._ffi.base.TVMError: [09:47:44] /usr/tvm/src/contrib/cudnn/conv_forward.cc:229: Check failed: e == CUDNN_STATUS_SUCCESS (2 vs. 0) cuDNN: CUDNN_STATUS_ALLOC_FAILED

Stack trace returned 10 entries:
[bt] (0) /usr/local/lib/python3.5/dist-packages/tvm-0.5.dev0-py3.5-linux-x86_64.egg/tvm/libtvm.so(+0x73261d) [0x7f7029a6d61d]
[bt] (1) /usr/local/lib/python3.5/dist-packages/tvm-0.5.dev0-py3.5-linux-x86_64.egg/tvm/libtvm.so(+0xec1280) [0x7f702a1fc280]
[bt] (2) /usr/local/lib/python3.5/dist-packages/tvm-0.5.dev0-py3.5-linux-x86_64.egg/tvm/libtvm.so(+0xec1f74) [0x7f702a1fcf74]
[bt] (3) /usr/local/lib/python3.5/dist-packages/tvm-0.5.dev0-py3.5-linux-x86_64.egg/tvm/libtvm.so(TVMFuncCall+0x5e) [0x7f702a17da8e]
[bt] (4) /usr/local/lib/python3.5/dist-packages/tvm-0.5.dev0-py3.5-linux-x86_64.egg/tvm/_ffi/_cy3/core.cpython-35m-x86_64-linux-gnu.so(+0x1862d) [0x7f6fa60de62d]
[bt] (5) /usr/local/lib/python3.5/dist-packages/tvm-0.5.dev0-py3.5-linux-x86_64.egg/tvm/_ffi/_cy3/core.cpython-35m-x86_64-linux-gnu.so(+0x18d1b) [0x7f6fa60ded1b]
[bt] (6) python3(PyObject_Call+0x47) [0x5c1797]
[bt] (7) python3(PyEval_EvalFrameEx+0x4ec6) [0x53bba6]
[bt] (8) python3(PyEval_EvalFrameEx+0x4b04) [0x53b7e4]
[bt] (9) python3() [0x5406df]

“”"

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File “tune_relay_cuda.py”, line 242, in
tune_and_evaluate(tuning_option)
File “tune_relay_cuda.py”, line 211, in tune_and_evaluate
tune_tasks(tasks, **tuning_opt)
File “tune_relay_cuda.py”, line 184, in tune_tasks
tuner_obj.load_history(autotvm.record.load_from_file(tmp_log_file))
File “/usr/local/lib/python3.5/dist-packages/tvm-0.5.dev0-py3.5-linux-x86_64.egg/tvm/autotvm/tuner/model_based_tuner.py”, line 272, in load_history
success = base_model.fit_log(data_set, self.plan_size)
File “/usr/local/lib/python3.5/dist-packages/tvm-0.5.dev0-py3.5-linux-x86_64.egg/tvm/autotvm/tuner/xgboost_cost_model.py”, line 223, in fit_log
res = pool.map(feature_extract_func, data)
File “/usr/lib/python3.5/multiprocessing/pool.py”, line 260, in map
return self._map_async(func, iterable, mapstar, chunksize).get()
File “/usr/lib/python3.5/multiprocessing/pool.py”, line 608, in get
raise self._value
tvm._ffi.base.TVMError: [09:47:44] /usr/tvm/src/contrib/cudnn/conv_forward.cc:229: Check failed: e == CUDNN_STATUS_SUCCESS (2 vs. 0) cuDNN: CUDNN_STATUS_ALLOC_FAILED

Stack trace returned 10 entries:
[bt] (0) /usr/local/lib/python3.5/dist-packages/tvm-0.5.dev0-py3.5-linux-x86_64.egg/tvm/libtvm.so(+0x73261d) [0x7f7029a6d61d]
[bt] (1) /usr/local/lib/python3.5/dist-packages/tvm-0.5.dev0-py3.5-linux-x86_64.egg/tvm/libtvm.so(+0xec1280) [0x7f702a1fc280]
[bt] (2) /usr/local/lib/python3.5/dist-packages/tvm-0.5.dev0-py3.5-linux-x86_64.egg/tvm/libtvm.so(+0xec1f74) [0x7f702a1fcf74]
[bt] (3) /usr/local/lib/python3.5/dist-packages/tvm-0.5.dev0-py3.5-linux-x86_64.egg/tvm/libtvm.so(TVMFuncCall+0x5e) [0x7f702a17da8e]
[bt] (4) /usr/local/lib/python3.5/dist-packages/tvm-0.5.dev0-py3.5-linux-x86_64.egg/tvm/_ffi/_cy3/core.cpython-35m-x86_64-linux-gnu.so(+0x1862d) [0x7f6fa60de62d]
[bt] (5) /usr/local/lib/python3.5/dist-packages/tvm-0.5.dev0-py3.5-linux-x86_64.egg/tvm/_ffi/_cy3/core.cpython-35m-x86_64-linux-gnu.so(+0x18d1b) [0x7f6fa60ded1b]
[bt] (6) python3(PyObject_Call+0x47) [0x5c1797]
[bt] (7) python3(PyEval_EvalFrameEx+0x4ec6) [0x53bba6]
[bt] (8) python3(PyEval_EvalFrameEx+0x4b04) [0x53b7e4]
[bt] (9) python3() [0x5406df]

any advice?

And my docker env is as follow:
CUDA_CUDNN_LIBRARY=/usr/local/cuda-8.0/lib64/libcudnn.so
CUDNN_VERSION=6.0.21
NVIDIA_REQUIRE_CUDA=cuda>=8.0
TF_NEED_CUDA=1
LIBRARY_PATH=/usr/local/cuda-8.0/lib64:/usr/local/cuda/lib64/stubs:
NVIDIA_VISIBLE_DEVICES=all
LD_LIBRARY_PATH=/usr/local/cuda-8.0/lib64:/usr/local/cuda/extras/CUPTI/lib64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64
NVIDIA_DRIVER_CAPABILITIES=compute,utility
PATH=/usr/local/llvm/bin:/usr/cmake-3.8.0-Linux-x86_64/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
CUDA_PKG_VERSION=8-0=8.0.61-1
CUDA_VERSION=8.0.61
HOME=/root
BAZEL_VERSION=0.5.4
CI_BUILD_PYTHON=python3
TF_CUDA_COMPUTE_CAPABILITIES=3.0,3.5,5.2,6.0,6.1
_=/usr/bin/env

cuDNN is not meant to be used with autotvm.

1 Like

Then how could i auto tuning an inception-v3 using CUDA GPU, i thought it were the cudnn that chose the best algo.
thks.

If you are interested in using cuDNN, you don’t need to use autotvm.

If your goal is to run inception-v3 as fast as possible, you can use autotvm. You can modify this tutorial script https://github.com/dmlc/tvm/blob/master/tutorials/autotvm/tune_relay_cuda.py

Hi @masahi the outcome of tune_relay_cuda.py comes with warnings like:
WARNING:autotvm:Cannot find config for target=cuda

the mean inference time is around 30ms,

and the performance of from_tensorflow.py which i slightly altered to inception-v3 from repo_base = ‘https://github.com/dmlc/web-data/raw/master/tensorflow/models/InceptionV3/’ with TVM VS tensorflow is 19ms vs 12ms,

Tensorflow protobuf imported to relay frontend.
WARNING:autotvm:Cannot find config for target=cuda, workload=('conv2d', (1, 2048, 1, 1, 'float32'), (1001, 2048, 1, 1, 'float32'), (1, 1), (0, 0), (1, 1), 'NCHW', 'float32'). A fallback configuration is used, which may bring great performance regression.
WARNING:autotvm:Cannot find config for target=cuda, workload=('conv2d', (1, 32, 149, 149, 'float32'), (64, 32, 3, 3, 'float32'), (1, 1), (0, 0), (1, 1), 'NCHW', 'float32'). A fallback configuration is used, which may bring great performance regression.
WARNING:autotvm:Cannot find config for target=cuda, workload=('conv2d', (1, 48, 39, 39, 'float32'), (64, 48, 5, 5, 'float32'), (1, 1), (0, 0), (1, 1), 'NCHW', 'float32'). A fallback configuration is used, which may bring great performance regression.
WARNING:autotvm:Cannot find config for target=cuda, workload=('conv2d', (1, 96, 37, 37, 'float32'), (96, 96, 3, 3, 'float32'), (1, 1), (0, 0), (1, 1), 'NCHW', 'float32'). A fallback configuration is used, which may bring great performance regression.
WARNING:autotvm:Cannot find config for target=cuda, workload=('conv2d', (1, 64, 37, 37, 'float32'), (96, 64, 3, 3, 'float32'), (1, 1), (0, 0), (1, 1), 'NCHW', 'float32'). A fallback configuration is used, which may bring great performance regression.
WARNING:autotvm:Cannot find config for target=cuda, workload=('conv2d', (1, 128, 23, 17, 'float32'), (192, 128, 7, 1, 'float32'), (1, 1), (0, 0), (1, 1), 'NCHW', 'float32'). A fallback configuration is used, which may bring great performance regression.
WARNING:autotvm:Cannot find config for target=cuda, workload=('conv2d', (1, 128, 17, 23, 'float32'), (128, 128, 1, 7, 'float32'), (1, 1), (0, 0), (1, 1), 'NCHW', 'float32'). A fallback configuration is used, which may bring great performance regression.
WARNING:autotvm:Cannot find config for target=cuda, workload=('conv2d', (1, 128, 17, 23, 'float32'), (192, 128, 1, 7, 'float32'), (1, 1), (0, 0), (1, 1), 'NCHW', 'float32'). A fallback configuration is used, which may bring great performance regression.
WARNING:autotvm:Cannot find config for target=cuda, workload=('conv2d', (1, 128, 23, 17, 'float32'), (128, 128, 7, 1, 'float32'), (1, 1), (0, 0), (1, 1), 'NCHW', 'float32'). A fallback configuration is used, which may bring great performance regression.
WARNING:autotvm:Cannot find config for target=cuda, workload=('conv2d', (1, 160, 23, 17, 'float32'), (192, 160, 7, 1, 'float32'), (1, 1), (0, 0), (1, 1), 'NCHW', 'float32'). A fallback configuration is used, which may bring great performance regression.
WARNING:autotvm:Cannot find config for target=cuda, workload=('conv2d', (1, 160, 17, 23, 'float32'), (160, 160, 1, 7, 'float32'), (1, 1), (0, 0), (1, 1), 'NCHW', 'float32'). A fallback configuration is used, which may bring great performance regression.
WARNING:autotvm:Cannot find config for target=cuda, workload=('conv2d', (1, 160, 17, 23, 'float32'), (192, 160, 1, 7, 'float32'), (1, 1), (0, 0), (1, 1), 'NCHW', 'float32'). A fallback configuration is used, which may bring great performance regression.
WARNING:autotvm:Cannot find config for target=cuda, workload=('conv2d', (1, 160, 23, 17, 'float32'), (160, 160, 7, 1, 'float32'), (1, 1), (0, 0), (1, 1), 'NCHW', 'float32'). A fallback configuration is used, which may bring great performance regression.
WARNING:autotvm:Cannot find config for target=cuda, workload=('conv2d', (1, 192, 23, 17, 'float32'), (192, 192, 7, 1, 'float32'), (1, 1), (0, 0), (1, 1), 'NCHW', 'float32'). A fallback configuration is used, which may bring great performance regression.
WARNING:autotvm:Cannot find config for target=cuda, workload=('conv2d', (1, 192, 17, 23, 'float32'), (192, 192, 1, 7, 'float32'), (1, 1), (0, 0), (1, 1), 'NCHW', 'float32'). A fallback configuration is used, which may bring great performance regression.
WARNING:autotvm:Cannot find config for target=cuda, workload=('conv2d', (1, 384, 8, 10, 'float32'), (384, 384, 1, 3, 'float32'), (1, 1), (0, 0), (1, 1), 'NCHW', 'float32'). A fallback configuration is used, which may bring great performance regression.
WARNING:autotvm:Cannot find config for target=cuda, workload=('conv2d', (1, 384, 10, 8, 'float32'), (384, 384, 3, 1, 'float32'), (1, 1), (0, 0), (1, 1), 'NCHW', 'float32'). A fallback configuration is used, which may bring great performance regression.
WARNING:autotvm:Cannot find config for target=cuda, workload=('conv2d', (1, 448, 10, 10, 'float32'), (384, 448, 3, 3, 'float32'), (1, 1), (0, 0), (1, 1), 'NCHW', 'float32'). A fallback configuration is used, which may bring great performance regression.
Evaluate inference time cost...
Mean inference time (std dev): 19.48 ms (0.35 ms)
Indian elephant (score = 0.57388)
black swan (score = 0.34540)
indri (score = 0.01911)
African chameleon (score = 0.00056)
muzzle (score = 0.00042)
2019-02-27 09:20:36.231730: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
2019-02-27 09:20:36.235613: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Found device 0 with properties: 
name: Tesla P40 major: 6 minor: 1 memoryClockRate(GHz): 1.531
pciBusID: 0000:02:00.0
totalMemory: 22.38GiB freeMemory: 22.04GiB
2019-02-27 09:20:36.582650: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Found device 1 with properties: 
name: Tesla P40 major: 6 minor: 1 memoryClockRate(GHz): 1.531
pciBusID: 0000:04:00.0
totalMemory: 22.38GiB freeMemory: 818.94MiB
2019-02-27 09:20:36.961547: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Found device 2 with properties: 
name: Tesla P40 major: 6 minor: 1 memoryClockRate(GHz): 1.531
pciBusID: 0000:83:00.0
totalMemory: 22.38GiB freeMemory: 22.21GiB
2019-02-27 09:20:37.329960: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Found device 3 with properties: 
name: Tesla P40 major: 6 minor: 1 memoryClockRate(GHz): 1.531
pciBusID: 0000:84:00.0
totalMemory: 22.38GiB freeMemory: 22.21GiB
2019-02-27 09:20:37.339719: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] Device peer to peer matrix
2019-02-27 09:20:37.339817: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1051] DMA: 0 1 2 3 
2019-02-27 09:20:37.339835: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1061] 0:   Y Y N N 
2019-02-27 09:20:37.339880: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1061] 1:   Y Y N N 
2019-02-27 09:20:37.339893: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1061] 2:   N N Y Y 
2019-02-27 09:20:37.339917: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1061] 3:   N N Y Y 
2019-02-27 09:20:37.339958: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: Tesla P40, pci bus id: 0000:02:00.0, compute capability: 6.1)
2019-02-27 09:20:37.339989: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:1) -> (device: 1, name: Tesla P40, pci bus id: 0000:04:00.0, compute capability: 6.1)
2019-02-27 09:20:37.340030: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:2) -> (device: 2, name: Tesla P40, pci bus id: 0000:83:00.0, compute capability: 6.1)
2019-02-27 09:20:37.340046: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:3) -> (device: 3, name: Tesla P40, pci bus id: 0000:84:00.0, compute capability: 6.1)
===== TENSORFLOW RESULTS =======
Indian elephant (score = 0.57388)
black swan (score = 0.34540)
indri (score = 0.01911)
African chameleon (score = 0.00056)
muzzle (score = 0.00042)
Evaluate tf inference time cost...
Mean tf inference time: 12.89 ms

is this performance in line with your expectations? Should the warning messages be concerned to fix to imporve the tvm performance and how?

Have a look at pages below to know the expected performance on inception v3, and how to reproduce these numbers. https://github.com/dmlc/tvm/tree/master/apps/benchmark

If you correctly finish tuning, there should be no warning like Cannot find config.... Usually tuning takes many hours to get good results (around 10 hours for 1000 iterations, on my CPU).

OK, thks, i will update messages when get progress.

The outcome after nearly 24h auto tuning with inception-v3 from repo ‘web-data/tensorflow/models/InceptionV3 at master · dmlc/web-data · GitHub’ is as follows, thought still can’t match the benchmark, the performance is slightly faster than tf now. Did i MISS something for tvm acceleration?

Tensorflow protobuf imported to relay frontend. Tuning… [Task 1/44] Current/Best: 1021.87/1219.86 GFLOPS | Progress: (912/1000) | 2150.24 s Done. [Task 2/44] Current/Best: 1269.74/1557.60 GFLOPS | Progress: (336/1000) | 742.92 s [Task 2/44] Current/Best: 1373.02/1557.60 GFLOPS | Progress: (624/1000) | 1408.27 s Done. [Task 3/44] Current/Best: 1116.06/1536.03 GFLOPS | Progress: (1000/1000) | 2229.07 s Done. [Task 4/44] Current/Best: 0.00/ 0.00 GFLOPS | Progress: (0/1000) | 0.00 s [Task 4/44] Current/Best: 766.59/1170.59 GFLOPS | Progress: (1000/1000) | 2381.79 s Done. [Task 5/44] Current/Best: 3.62/1976.47 GFLOPS | Progress: (288/1000) | 659.02 s [Task 5/44] Current/Best: 1612.76/2050.53 GFLOPS | Progress: (1000/1000) | 2316.05 s Done. [Task 6/44] Current/Best: 1662.04/1988.88 GFLOPS | Progress: (528/1000) | 1269.89 s [Task 6/44] Current/Best: 1764.36/1988.88 GFLOPS | Progress: (1000/1000) | 2410.77 s Done. [Task 7/44] Current/Best: 1471.26/1655.86 GFLOPS | Progress: (720/1000) | 1754.97 s Done. [Task 8/44] Current/Best: 0.00/ 0.00 GFLOPS | Progress: (0/1000) | 0.00 s [Task 8/44] Current/Best: 1319.26/1759.06 GFLOPS | Progress: (1000/1000) | 2383.12 s Done. [Task 9/44] Current/Best: 1219.34/1652.44 GFLOPS | Progress: (240/1000) | 585.62 s [Task 9/44] Current/Best: 1529.01/1652.44 GFLOPS | Progress: (624/1000) | 1526.62 s Done. [Task 10/44] Current/Best: 682.70/1015.74 GFLOPS | Progress: (720/1000) | 1882.66 s [Task 10/44] Current/Best: 808.71/1015.74 GFLOPS | Progress: (912/1000) | 2364.10 s Done. [Task 11/44] Current/Best: 1307.89/1448.62 GFLOPS | Progress: (1000/1000) | 2310.18 s Done. [Task 12/44] Current/Best: 1349.99/1380.38 GFLOPS | Progress: (48/1000) | 96.47 s [Task 12/44] Current/Best: 1244.60/1380.38 GFLOPS | Progress: (624/1000) | 1462.93 s Done. [Task 13/44] Current/Best: 1052.81/1368.10 GFLOPS | Progress: (624/1000) | 1541.47 s [Task 13/44] Current/Best: 1107.23/1368.10 GFLOPS | Progress: (960/1000) | 2411.49 s Done. [Task 14/44] Current/Best: 151.41/1378.73 GFLOPS | Progress: (672/1000) | 1639.45 s Done. [Task 15/44] Current/Best: 12.06/1418.46 GFLOPS | Progress: (96/1000) | 186.28 s [Task 15/44] Current/Best: 376.89/1418.46 GFLOPS | Progress: (672/1000) | 1473.37 s Done. [Task 16/44] Current/Best: 1232.89/1379.08 GFLOPS | Progress: (720/1000) | 1675.84 s [Task 16/44] Current/Best: 536.33/1379.08 GFLOPS | Progress: (864/1000) | 2009.82 s Done. [Task 17/44] Current/Best: 195.39/1374.31 GFLOPS | Progress: (672/1000) | 1541.32 s Done. [Task 18/44] Current/Best: 885.00/1217.58 GFLOPS | Progress: (432/1000) | 938.00 s [Task 18/44] Current/Best: 923.95/1217.58 GFLOPS | Progress: (624/1000) | 1362.16 s Done. [Task 19/44] Current/Best: 221.10/1371.77 GFLOPS | Progress: (672/1000) | 1534.27 s Done. [Task 20/44] Current/Best: 152.65/1224.62 GFLOPS | Progress: (336/1000) | 749.70 s [Task 20/44] Current/Best: 1076.02/1224.62 GFLOPS | Progress: (624/1000) | 1381.18 s Done. [Task 21/44] Current/Best: 89.39/1229.27 GFLOPS | Progress: (864/1000) | 1918.05 s Done. [Task 22/44] Current/Best: 9.98/1357.43 GFLOPS | Progress: (96/1000) | 210.07 s [Task 22/44] Current/Best: 48.29/1357.43 GFLOPS | Progress: (672/1000) | 1566.64 s Done. [Task 23/44] Current/Best: 3.54/ 974.60 GFLOPS | Progress: (672/1000) | 1575.18 s [Task 23/44] Current/Best: 146.05/ 974.60 GFLOPS | Progress: (864/1000) | 2066.42 s Done. [Task 24/44] Current/Best: 1316.98/1518.00 GFLOPS | Progress: (912/1000) | 2270.70 s Done. [Task 25/44] Current/Best: 0.00/ 0.00 GFLOPS | Progress: (0/1000) | 0.00 s [Task 25/44] Current/Best: 790.18/1486.26 GFLOPS | Progress: (720/1000) | 1701.95 s Done. [Task 26/44] Current/Best: 865.40/1133.76 GFLOPS | Progress: (528/1000) | 1199.72 s [Task 26/44] Current/Best: 210.40/1133.76 GFLOPS | Progress: (672/1000) | 1536.50 s Done. [Task 27/44] Current/Best: 97.37/2563.44 GFLOPS | Progress: (864/1000) | 2425.61 s [Task 27/44] Current/Best: 1877.61/2642.34 GFLOPS | Progress: (1000/1000) | 2781.50 s Done. [Task 28/44] Current/Best: 2841.73/3217.89 GFLOPS | Progress: (912/1000) | 2438.70 s [Task 28/44] Current/Best: 2267.63/3217.89 GFLOPS | Progress: (1000/1000) | 2647.75 s Done. [Task 29/44] Current/Best: 372.36/1403.15 GFLOPS | Progress: (672/1000) | 1673.16 s Done. [Task 30/44] Current/Best: 5.88/2904.38 GFLOPS | Progress: (288/1000) | 724.57 s [Task 30/44] Current/Best: 2231.02/3428.36 GFLOPS | Progress: (1000/1000) | 2415.71 s Done. [Task 31/44] Current/Best: 2.24/1295.15 GFLOPS | Progress: (480/1000) | 1013.80 s [Task 31/44] Current/Best: 935.58/1295.15 GFLOPS | Progress: (1000/1000) | 2051.61 s Done. [Task 32/44] Current/Best: 3786.39/5022.14 GFLOPS | Progress: (720/1000) | 1774.86 s [Task 32/44] Current/Best: 1782.71/5022.14 GFLOPS | Progress: (864/1000) | 2116.42 s Done. [Task 33/44] Current/Best: 3692.12/5633.19 GFLOPS | Progress: (768/1000) | 1877.25 s Done. [Task 34/44] Current/Best: 1583.97/1778.25 GFLOPS | Progress: (48/1000) | 97.34 s [Task 34/44] Current/Best: 1277.49/2024.29 GFLOPS | Progress: (1000/1000) | 2147.75 s Done. [Task 35/44] Current/Best: 36.45/3177.06 GFLOPS | Progress: (288/1000) | 612.35 s [Task 35/44] Current/Best: 2571.03/3177.06 GFLOPS | Progress: (816/1000) | 1662.75 s Done. [Task 36/44] Current/Best: 1140.39/1543.59 GFLOPS | Progress: (624/1000) | 1455.33 s Done. [Task 37/44] Current/Best: 1372.41/1613.80 GFLOPS | Progress: (624/1000) | 1418.68 s Done. [Task 38/44] Current/Best: 1231.78/1630.90 GFLOPS | Progress: (240/1000) | 545.57 s [Task 38/44] Current/Best: 910.44/1630.90 GFLOPS | Progress: (624/1000) | 1547.24 s Done. [Task 39/44] Current/Best: 1430.40/1659.09 GFLOPS | Progress: (816/1000) | 1787.50 s [Task 39/44] Current/Best: 1311.10/1659.09 GFLOPS | Progress: (1000/1000) | 2183.07 s Done. [Task 40/44] Current/Best: 1209.67/1496.45 GFLOPS | Progress: (816/1000) | 1836.55 s Done. [Task 41/44] Current/Best: 1075.58/1308.51 GFLOPS | Progress: (912/1000) | 2334.24 s Done. [Task 42/44] Current/Best: 1223.91/1565.64 GFLOPS | Progress: (1000/1000) | 2478.35 s Done. [Task 43/44] Current/Best: 962.30/1356.68 GFLOPS | Progress: (1000/1000) | 2281.19 s Done. [Task 44/44] Current/Best: 0.00/ 0.00 GFLOPS | Progress: (0/1000) | 0.00 s [Task 44/44] Current/Best: 8.44/ 43.83 GFLOPS | Progress: (816/1000) | 1448.04 s Done. Compile… Indian elephant (score = 0.57388) black swan (score = 0.34540) indri (score = 0.01911) African chameleon (score = 0.00056) muzzle (score = 0.00042) Evaluate inference time cost… Mean inference time (std dev): 9.45 ms (0.54 ms) 2019-03-01 10:54:43.792112: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA 2019-03-01 10:54:43.793763: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Found device 0 with properties: name: Tesla P40 major: 6 minor: 1 memoryClockRate(GHz): 1.531 pciBusID: 0000:02:00.0 totalMemory: 22.38GiB freeMemory: 22.05GiB 2019-03-01 10:54:44.203407: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Found device 1 with properties: name: Tesla P40 major: 6 minor: 1 memoryClockRate(GHz): 1.531 pciBusID: 0000:04:00.0 totalMemory: 22.38GiB freeMemory: 818.94MiB 2019-03-01 10:54:44.638720: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Found device 2 with properties: name: Tesla P40 major: 6 minor: 1 memoryClockRate(GHz): 1.531 pciBusID: 0000:83:00.0 totalMemory: 22.38GiB freeMemory: 22.21GiB 2019-03-01 10:54:45.036777: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Found device 3 with properties: name: Tesla P40 major: 6 minor: 1 memoryClockRate(GHz): 1.531 pciBusID: 0000:84:00.0 totalMemory: 22.38GiB freeMemory: 22.21GiB 2019-03-01 10:54:45.045374: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] Device peer to peer matrix 2019-03-01 10:54:45.045478: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1051] DMA: 0 1 2 3 2019-03-01 10:54:45.045498: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1061] 0: Y Y N N 2019-03-01 10:54:45.045550: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1061] 1: Y Y N N 2019-03-01 10:54:45.045564: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1061] 2: N N Y Y 2019-03-01 10:54:45.045588: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1061] 3: N N Y Y 2019-03-01 10:54:45.045615: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:0) → (device: 0, name: Tesla P40, pci bus id: 0000:02:00.0, compute capability: 6.1) 2019-03-01 10:54:45.045634: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:1) → (device: 1, name: Tesla P40, pci bus id: 0000:04:00.0, compute capability: 6.1) 2019-03-01 10:54:45.045650: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:2) → (device: 2, name: Tesla P40, pci bus id: 0000:83:00.0, compute capability: 6.1) 2019-03-01 10:54:45.045666: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:3) → (device: 3, name: Tesla P40, pci bus id: 0000:84:00.0, compute capability: 6.1) ===== TENSORFLOW RESULTS ======= Indian elephant (score = 0.57388) black swan (score = 0.34540) indri (score = 0.01911) African chameleon (score = 0.00056) muzzle (score = 0.00042) Evaluate tf inference time cost… Mean tf inference time: 12.56 ms

Hi Masahi, Thanks a lot for your reply. In my opinion, cuDNN is optimized lib, we needn’t use autoTVM with cuDNN. But in TVM discussion, i see usage -lib=cuDNN. I cannot understand this usage. Can you help to clarify? Thank you

Hi,
For -lib=cuDNN,in my opinion,that means in auto-tuning,auto-tvm will use cuDNN for some specific operation instead of auto-tuned config

1 Like

no, autotvm has nothing to do with cuDNN. cuDNN can be thought of as an alternative convolution backend. If you find cuDNN kernels are faster then tvm for your model, you can simply swap convolution backend by appending -lib=cuDNN to “target=cuda”.

1 Like

Hi Masahi,
Thank you. In case -lib=cuDNN is used, kernels in cuDNN will not auto-turned by TVM. The specified configuration (implementaion) in cuDNN is used directly. I think Peterlau123 has the same understanding. Thank you Peter also.

@masahi Why TVM can’t support to search the scheduler from cudnn? If, for example, the scheduler results obtained from TVM’s search are not better than cuDNN, you can use cuDNN’s scheduler for implementation.