How to make get_output function faster?

lilixiao · December 3, 2019, 3:46am

When importing the demo （https://docs.tvm.ai/tutorials/frontend/from_tensorflow.html ）and running it locally for testing, the run process executes faster, but the result output takes time.

execute

"

time1 = time.clock()

m.set_input(‘DecodeJpeg/contents’, tvm.nd.array(x.astype(dtype)))

m.set_input(**params)

m.run()

time3 = time.clock()

tvm_output = m.get_output(0, tvm.nd.empty(((1, 1008)), ‘float32’))

time4 = time.clock()

print(“time4 - time3:”, time4 - time3)

print(“time3 - time1:”, time3 - time1)

"

#result:

time4 - time3: 1.049209000000019

time3 - time1: 0.5478069999999775

environment：

cuda版本：CUDA Version 10.0.130

cudnn：7.6.5

tensorflow-gpu版本：1.14.0

python版本：3.5.9

GPU：tesla p4

vinx13 · December 3, 2019, 4:07am

m.run() is asynchronous, when you call get_output it blocks until the model execution finished

lilixiao · December 3, 2019, 4:18am

How to speed up the execution of get_output function? When I run the inference process multiple times, the run function is fast, but the get_output function becomes a bottleneck

jonso · December 3, 2019, 5:31pm

run() doesn’t wait until model execution is finished. It simply launches the kernels on the GPU and returns immediately. That’s why it’s fast. get_output() waits until the kernels are finished to get the result.

Basically, the timer you have around get_output() is the correct way to get execution time for the model. Otherwise, you’re just measuring kernel launch time.

coincheung · July 15, 2021, 3:17am

How could I let run() be synchronized, and block until the execution is finished ?