How to make get_output function faster?

When importing the demo (https://docs.tvm.ai/tutorials/frontend/from_tensorflow.html )and running it locally for testing, the run process executes faster, but the result output takes time.

execute

"

time1 = time.clock()

m.set_input(‘DecodeJpeg/contents’, tvm.nd.array(x.astype(dtype)))

m.set_input(**params)

m.run()

time3 = time.clock()

tvm_output = m.get_output(0, tvm.nd.empty(((1, 1008)), ‘float32’))

time4 = time.clock()

print(“time4 - time3:”, time4 - time3)

print(“time3 - time1:”, time3 - time1)

"

#result:

time4 - time3: 1.049209000000019

time3 - time1: 0.5478069999999775

environment:

cuda版本:CUDA Version 10.0.130

cudnn:7.6.5

tensorflow-gpu版本:1.14.0

python版本:3.5.9

GPU:tesla p4

m.run() is asynchronous, when you call get_output it blocks until the model execution finished

How to speed up the execution of get_output function? When I run the inference process multiple times, the run function is fast, but the get_output function becomes a bottleneck

run() doesn’t wait until model execution is finished. It simply launches the kernels on the GPU and returns immediately. That’s why it’s fast. get_output() waits until the kernels are finished to get the result.

Basically, the timer you have around get_output() is the correct way to get execution time for the model. Otherwise, you’re just measuring kernel launch time.

How could I let run() be synchronized, and block until the execution is finished ?