How to make get_output function faster?

When importing the demo ( )and running it locally for testing, the run process executes faster, but the result output takes time.



time1 = time.clock()

m.set_input(‘DecodeJpeg/contents’, tvm.nd.array(x.astype(dtype)))


time3 = time.clock()

tvm_output = m.get_output(0, tvm.nd.empty(((1, 1008)), ‘float32’))

time4 = time.clock()

print(“time4 - time3:”, time4 - time3)

print(“time3 - time1:”, time3 - time1)



time4 - time3: 1.049209000000019

time3 - time1: 0.5478069999999775


cuda版本:CUDA Version 10.0.130




GPU:tesla p4 is asynchronous, when you call get_output it blocks until the model execution finished

How to speed up the execution of get_output function? When I run the inference process multiple times, the run function is fast, but the get_output function becomes a bottleneck

run() doesn’t wait until model execution is finished. It simply launches the kernels on the GPU and returns immediately. That’s why it’s fast. get_output() waits until the kernels are finished to get the result.

Basically, the timer you have around get_output() is the correct way to get execution time for the model. Otherwise, you’re just measuring kernel launch time.

How could I let run() be synchronized, and block until the execution is finished ?