Very slow under linux cuda

cheneyfan · November 15, 2019, 8:33am

when i try to convert onnx2tvm, the warning is: WARNING:autotvm:Cannot find config for target=cuda, workload=(‘conv2d_transpose_nchw’, (1, 256, 64, 64, ‘float32’), (256, 128, 3, 3, ‘float32’), (2, 2), (1, 1), ‘float32’). A fallback configuration is used, which may bring great performance regression.

And, i use the convert so to run, module.run() is very fast but module.get_output(0).asnumpy() is very slow, so the total time cost much

what’s wrong?

lsy643 · November 15, 2019, 9:17am

Nothing is wrong. When you call module.run(), you just put all your cuda kernels into a default cuda stream. And when you call module.get_output(0).asnumpy(), it will call a cuda memory copy function, which is a synchronized function, so you will wait for all the computation in the dafault cuda stream is done.

cheneyfan · November 15, 2019, 9:47am

Then, how to speed up? GPU is Tesla P4

lsy643 · November 16, 2019, 11:10am

I mean the time between the point when you call module.run() and the point when you get the result from module.get_output(0).asnumpy() is your module need to finish all the computation, so there is nothing to speed up.

cheneyfan · November 17, 2019, 3:43am

i mean why tvm is so slow （x s）compared with tensorRT which cost only x ms

lsy643 · November 17, 2019, 4:10am

Try to use autotvm to find the cuda configuration for your network and hardware, which will be stored in a log file. And use that log file when you call relay.build.

wxyhv · April 2, 2021, 7:57am

Hello~ I am facing the similar problem!

I use autotvm tuned a CNN model trained by tensorflow, all the ops in the model were tuned.

After that, I load log file by relay, and test it’s performence, I found the whole TVM infer time is bigger than the tensofrflow far away.

The “mod.get_out(0).asnumpy()” time is about 240ms!!!

I observe the following information when I test the TVM tuned model.

Extract tasks...
Compile...
Cannot find config for target=cuda -keys=cuda,gpu -max_num_threads=1024 -model=unknown - 
thread_warp_size=32, workload=('dense_small_batch.cuda', ('TENSOR', (2500, 512), 'float32'), 
('TENSOR', (6600, 512), 'float32'), None, 'float32'). A fallback configuration is used, which may bring 
great performance regression.

How to fix this non-exists configuration for workload named “dense_small_batch.cuda” ?

Looking for your reply! Thank you very much!