[TensorRT] Seems ctx.sync() does not work while using TensorRT on Jetson Xavier NX

I’m trying to deploy Jetson Xavier NX using TensorRT, following the tutorial here.

Following the original TVM template to build works fine:

###model is mobilenet_v2
mod, params = relay.frontend.from_pytorch(scripted_model, shape_list)

tgt = tvm.target.cuda()
ctx = tvm.gpu(0)
with tvm.transform.PassContext(opt_level=3):
        g, m, p = relay.build(mod, tgt, params=params)
module = graph_runtime.create(g, m, ctx)
module.set_input(**p)

for i in range(15):
    start = time.clock_gettime(time.CLOCK_REALTIME)*1000
    data = np.random.uniform(-1, 1, (1,3,224,224)).astype("float32")
    module.set_input("data", data)
    module.run()
    ctx_sync()
    end = time.clock_gettime(time.CLOCK_REALTIME)*1000
    print(end - start)

It seems the result for each opt_level is quite reasonable as follows, while they are median of 10 iterations excluding the first 5:

opt_level=0: 67.904052734375ms
opt_level=1: 44.286865234375ms
opt_level=2: 42.317626953125ms
opt_level=3: 39.987060546875ms

But if I use TensorRT template to build as below:

###model is mobilenet_v2
mod, params = relay.frontend.from_pytorch(scripted_model, shape_list)

from tvm.relay.op.contrib.tensorrt import partition_for_tensorrt
mod, config = partition_for_tensorrt(mod, params)

tgt = tvm.target.cuda()
ctx = tvm.gpu(0)

with tvm.transform.PassContext(opt_level=args.opt_level, config={'relay.ext.tensorrt.options': config}):
    g, m, p = relay.build(mod, tgt, params=params)

module = graph_runtime.create(g, m, ctx)
module.set_input(**p)

for i in range(15):
    start = time.clock_gettime(time.CLOCK_REALTIME)*1000
    data = np.random.uniform(-1, 1, (1,3,224,224)).astype("float32")
    module.set_input("data", data)
    module.run()
    ctx_sync()
    end = time.clock_gettime(time.CLOCK_REALTIME)*1000
    print(end - start)

The resulted latency is too short and it seems the context sync has not been processed:

opt_level=0: 5.004150390625ms

I’m quite sure because it shows the similar value if I remove ‘ctx.sync()’ from the original template.

I checked the issue from here and tried the same code but the result was the same.

Does anyone have idea about this?

I found these benchmarks:

Considering this, is 5ms for a single batch mobilenet_v2 a reasonable result? But I’m not really sure if there is this much performance gap between using TensorRT and TVM…

The simplest way is checking if the results from TVM and TensorRT are matched. For GPU, it’s totally possible that TesorRT outperforms TVM if you didn’t tune the model.

Also cc @trevor-m

1 Like

The TensorRT execution we use in TVM is not asynchronous, so there is no need to sync. module.run() won’t return until inference is completed. Actually I think run() is never asynchronous in TVM?

5ms is not an unreasonable inference time for mobilenet v2 with TensorRT on xavier, although I am getting around 10ms. But your model may be different.

1 Like

Thanks for your replies! I checked the result with the code like below and it seems the results are the same:

#tvm_trt_compare.py

...

mod, params = relay.frontend.from_pytorch(scripted_model, shape_list)

tgt = tvm.target.cuda()
ctx = tvm.gpu(0)

###Same Input
data = np.random.uniform(-1, 1, size=input_shape).astype(dtype)


###GPU with TVM
with tvm.transform.PassContext(opt_level=args.opt_level):
    g1, m1, p1 = relay.build(mod, tgt, params=params)

module1 = graph_runtime.create(g1, m1, ctx)
module1.set_input(**p1)
module1.set_input("data", data)
module1.run()
out_tvm = module1.get_output(0).asnumpy()


###GPU with TensorRT
from tvm.relay.op.contrib.tensorrt import partition_for_tensorrt
mod, config = partition_for_tensorrt(mod, params)

with tvm.transform.PassContext(opt_level=args.opt_level, config={'relay.ext.tensorrt.options': config}):
    g2, m2, p2 = relay.build(mod, tgt, params=params)

module2 = graph_runtime.create(g2, m2, ctx)
module2.set_input(**p2)
module2.set_input("data",data)
module2.run()
out_trt = module2.get_output(0).asnumpy()


if np.all( np.abs( out_tvm - out_trt ) < 1e-5):
    print("CLEAR")
else:
    print("FAIL")

Also Ccing @trevor-m,

But what I don’t understand is that applying opt_level=0~2 gives me “FAIL”, while opt_level=3 gives me “CLEAR”.

$ python3 tvm_trt_compare.py --opt_level 0
FAIL
$ python3 tvm_trt_compare.py --opt_level 3
CLEAR

Does TensorRT on TVM support only with opt_level=3?

Besides, I measured the time taken with the same code base:

$ python3 tvm_trt_compare.py --opt_level 3
GPU_TVM: 6.222900390625
GPU_TRT: 6.257080078125
CLEAR

Maybe it is because these two different runtime modules (module1 and module2) share the same context?

Not sure about why opt_level=0,1,2 results in different outputs. Maybe TensorRT assumes an opt_level 3 pass is always on so the correcntess without that pass is not guaranteed, but this is just my guess.

For the performance, although again I’m not sure if this is the reason, you can add this line before each build. This cleans the compile engine cache to avoid incorrect measurement.

relay.backend.compile_engine.get().clear()
with tvm.transform.PassContext(...):
    g2, m2, p2 = relay.build(mod, tgt, params=params)
1 Like

Thanks for the tip!

After adding the line you mentioned, the result gives me “CLEAR” even when I ran with opt_level=0.

It was very helpful:)