Inference time of yolov3 too long

As i build and save darknet yolov3 model with
target = ‘cuda -libs=cudnn’
ctx = tvm.gpu()

Then us graph_runtime.create reload and test the model, found inference time too log

as input is 608*608
original darknet tested cost time about 30ms (GTX980TI)
now with TVM + cudnn cost about 300ms

as tested
start = time.clock()
model_handle.run()
print("0 time: ",time.clock() - start)
most time cost on above , more than 290ms

plz, help to check such issue, where is wrong with me ?

Can you share how you are compiling and running the model?
Note that if you use add -libs=cudnn you will not use TVM to generate the GPU kernels, so there should be essentially no difference with using darknet.

as i use the code form { Compile YOLO-V2 and YOLO-V3 in DarkNet Models }
https://docs.tvm.ai/tutorials/nnvm/from_darknet.html?highlight=yolo


and set use GPU mode
GPU = 1
if not GPU:
target = ‘llvm’
ctx = tvm.cpu(0)
else:
#target = tvm.target.cuda()
target = 'cuda -libs=cudnn, cublas'
#target = ‘cuda -libs=cudnn’
ctx = tvm.gpu(0)

then save the model

from tvm.contrib import util
path_lib = “./data/deploy_lib.tar”
lib.export_library(path_lib)
with open("./data/deploy_graph.json", “w”) as fo:
fo.write(graph.json())
with open("./data/deploy_param.params", “wb”) as fo:
fo.write(nnvm.compiler.save_param_dict(params))


last step loaded saved model again and test
ctx = tvm.gpu(0)
data = nnvm.testing.darknet.load_image(test_image, netw, neth)

loaded_json = open(graph_file).read()
loaded_lib = tvm.module.load(lib_file)
loaded_params = bytearray(open(params_file, “rb”).read())

1 Like

also compile the model has warning of

WARNING:autotvm:Cannot find config for target=cuda -libs=cudnn, cublas, workload=(‘conv2d’, (1, 256, 76, 76, ‘float32’), (255, 256, 1, 1, ‘float32’), (1, 1), (0, 0), (1, 1), ‘NCHW’, ‘float32’). A fallback configuration is used, which may bring great performance regression.
[18:08:04] /home/tvm/src/contrib/cudnn/conv_forward.cc:243: CUDNN Found 8 fwd algorithms, choosing CUDNN_CONVOLUTION_FWD_ALGO_IMPLICIT_PRECOMP_GEMM
[18:08:04] /home/tvm/src/contrib/cudnn/conv_forward.cc:246: 0) CUDNN_CONVOLUTION_FWD_ALGO_IMPLICIT_PRECOMP_GEMM - time: 0.206624 ms, Memory: 34664
[18:08:04] /home/tvm/src/contrib/cudnn/conv_forward.cc:246: 1) CUDNN_CONVOLUTION_FWD_ALGO_IMPLICIT_GEMM - time: 0.284064 ms, Memory: 0
[18:08:04] /home/tvm/src/contrib/cudnn/conv_forward.cc:246: 2) CUDNN_CONVOLUTION_FWD_ALGO_GEMM - time: 0.443744 ms, Memory: 5914624
[18:08:04] /home/tvm/src/contrib/cudnn/conv_forward.cc:246: 3) CUDNN_CONVOLUTION_FWD_ALGO_FFT_TILING - time: 0.776032 ms, Memory: 19442304
[18:08:04] /home/tvm/src/contrib/cudnn/conv_forward.cc:246: 4) CUDNN_CONVOLUTION_FWD_ALGO_FFT - time: -1 ms, Memory: 8724545536
[18:08:04] /home/tvm/src/contrib/cudnn/conv_forward.cc:246: 5) CUDNN_CONVOLUTION_FWD_ALGO_DIRECT - time: -1 ms, Memory: 0
[18:08:04] /home/tvm/src/contrib/cudnn/conv_forward.cc:246: 6) CUDNN_CONVOLUTION_FWD_ALGO_WINOGRAD - time: -1 ms, Memory: 0
[18:08:04] /home/tvm/src/contrib/cudnn/conv_forward.cc:246: 7) CUDNN_CONVOLUTION_FWD_ALGO_WINOGRAD_NONFUSED - time: -1 ms, Memory: 0

The autotvm warning should not be an issue as -libs=cudnn is being used. Can you try using a time evaluator instead to do the timing? I am not sure if there is some other overhead or if there is some dynamic compilation time being included that only occurs on the first run, and this can affect the timing results with your measurement method. You can use the time evaluator with something like:

f = m.module.time_evaluator('run', ctx)
results = f()

where results should give you the running time in seconds. I get ~17ms out-of-the-box with cuDNN on an RTX 2080 Ti.

I got result as below

0 time: 0.32214300000000007
evaluator------------> cost time: ProfileResult(mean=0.0375652296, results=(0.0375652296,))

time 0 is calculate as below:
start = time.clock()
model_handle.run()
tvm.gpu(0).sync()
print("0 time: ",time.clock() - start)

how is that ?

as i tested, like you say, the first time cost too much time~

That is fine, but what are the results you get with later runs? That should agree with the time evaluator result, which ignores the first run.

Note that the first run can be slower for many different seasons such as JIT compilation, etc… This is typical and expected behavior for many different framework/backend combinations.

yes , exclude the first time, others as same with the evaluator result,so thanks

The warning messages of “autotvm:Cannot find config for target=cuda” is due to the op name when you announce the tasks from the tutorial script tune_relay_cuda.py, which i changed to relay.nn.conv2d, in my case, and now the auto-tuning seems fine now.

This is unrelated to their issue, as they are using the cuDNN operators, not AutoTVM (autotuned) ones.

I use the target = cuda , but the inference result is so slow using m.module.time_evaluator, when I add the target = ‘cuda -libs=cudnn, culabs’ It raise ValueError: Cannot find global function tvm.contrib.cudnn.conv2d.output_shape

What’s the wrong? Thanks!!

You are offloading library calls to cudnn when you use -libs=cudnn, so there is no support for unhandled calls. What model are you using?

It may be caused by your build tvm version without cudnn support.
Try switch the USE_CUDNN and USE_CUBLAS flag On in config.cmake, and rebuild your tvm.