[TensorRT] Seems ctx.sync() does not work while using TensorRT on Jetson Xavier NX

beomwookang · March 31, 2021, 6:19am

I’m trying to deploy Jetson Xavier NX using TensorRT, following the tutorial here.

Following the original TVM template to build works fine:

###model is mobilenet_v2
mod, params = relay.frontend.from_pytorch(scripted_model, shape_list)

tgt = tvm.target.cuda()
ctx = tvm.gpu(0)
with tvm.transform.PassContext(opt_level=3):
        g, m, p = relay.build(mod, tgt, params=params)
module = graph_runtime.create(g, m, ctx)
module.set_input(**p)

for i in range(15):
    start = time.clock_gettime(time.CLOCK_REALTIME)*1000
    data = np.random.uniform(-1, 1, (1,3,224,224)).astype("float32")
    module.set_input("data", data)
    module.run()
    ctx_sync()
    end = time.clock_gettime(time.CLOCK_REALTIME)*1000
    print(end - start)

It seems the result for each opt_level is quite reasonable as follows, while they are median of 10 iterations excluding the first 5:

opt_level=0: 67.904052734375ms
opt_level=1: 44.286865234375ms
opt_level=2: 42.317626953125ms
opt_level=3: 39.987060546875ms

But if I use TensorRT template to build as below:

###model is mobilenet_v2
mod, params = relay.frontend.from_pytorch(scripted_model, shape_list)

from tvm.relay.op.contrib.tensorrt import partition_for_tensorrt
mod, config = partition_for_tensorrt(mod, params)

tgt = tvm.target.cuda()
ctx = tvm.gpu(0)

with tvm.transform.PassContext(opt_level=args.opt_level, config={'relay.ext.tensorrt.options': config}):
    g, m, p = relay.build(mod, tgt, params=params)

module = graph_runtime.create(g, m, ctx)
module.set_input(**p)

for i in range(15):
    start = time.clock_gettime(time.CLOCK_REALTIME)*1000
    data = np.random.uniform(-1, 1, (1,3,224,224)).astype("float32")
    module.set_input("data", data)
    module.run()
    ctx_sync()
    end = time.clock_gettime(time.CLOCK_REALTIME)*1000
    print(end - start)

The resulted latency is too short and it seems the context sync has not been processed:

opt_level=0: 5.004150390625ms

I’m quite sure because it shows the similar value if I remove ‘ctx.sync()’ from the original template.

I checked the issue from here and tried the same code but the result was the same.

Does anyone have idea about this?

beomwookang · March 31, 2021, 6:51am

I found these benchmarks:

NVIDIA Jetson Nano and Jetson Xavier NX Comparison: Specifications, Benchmarking, Container Demos, and Custom Model Inference - Latest open tech from seeed studio
Jetson AGX Xavier: Deep Learning Inference Benchmarks | NVIDIA Developer

Considering this, is 5ms for a single batch mobilenet_v2 a reasonable result? But I’m not really sure if there is this much performance gap between using TensorRT and TVM…

comaniac · March 31, 2021, 4:50pm

The simplest way is checking if the results from TVM and TensorRT are matched. For GPU, it’s totally possible that TesorRT outperforms TVM if you didn’t tune the model.

Also cc @trevor-m

trevor-m · March 31, 2021, 5:11pm

The TensorRT execution we use in TVM is not asynchronous, so there is no need to sync. module.run() won’t return until inference is completed. Actually I think run() is never asynchronous in TVM?

5ms is not an unreasonable inference time for mobilenet v2 with TensorRT on xavier, although I am getting around 10ms. But your model may be different.

beomwookang · April 1, 2021, 7:50am

Thanks for your replies! I checked the result with the code like below and it seems the results are the same:

#tvm_trt_compare.py

...

mod, params = relay.frontend.from_pytorch(scripted_model, shape_list)

tgt = tvm.target.cuda()
ctx = tvm.gpu(0)

###Same Input
data = np.random.uniform(-1, 1, size=input_shape).astype(dtype)


###GPU with TVM
with tvm.transform.PassContext(opt_level=args.opt_level):
    g1, m1, p1 = relay.build(mod, tgt, params=params)

module1 = graph_runtime.create(g1, m1, ctx)
module1.set_input(**p1)
module1.set_input("data", data)
module1.run()
out_tvm = module1.get_output(0).asnumpy()


###GPU with TensorRT
from tvm.relay.op.contrib.tensorrt import partition_for_tensorrt
mod, config = partition_for_tensorrt(mod, params)

with tvm.transform.PassContext(opt_level=args.opt_level, config={'relay.ext.tensorrt.options': config}):
    g2, m2, p2 = relay.build(mod, tgt, params=params)

module2 = graph_runtime.create(g2, m2, ctx)
module2.set_input(**p2)
module2.set_input("data",data)
module2.run()
out_trt = module2.get_output(0).asnumpy()


if np.all( np.abs( out_tvm - out_trt ) < 1e-5):
    print("CLEAR")
else:
    print("FAIL")

Also Ccing @trevor-m,

But what I don’t understand is that applying opt_level=0~2 gives me “FAIL”, while opt_level=3 gives me “CLEAR”.

$ python3 tvm_trt_compare.py --opt_level 0
FAIL
$ python3 tvm_trt_compare.py --opt_level 3
CLEAR

Does TensorRT on TVM support only with opt_level=3?

Besides, I measured the time taken with the same code base:

$ python3 tvm_trt_compare.py --opt_level 3
GPU_TVM: 6.222900390625
GPU_TRT: 6.257080078125
CLEAR

Maybe it is because these two different runtime modules (module1 and module2) share the same context?

comaniac · April 1, 2021, 6:27pm

Not sure about why opt_level=0,1,2 results in different outputs. Maybe TensorRT assumes an opt_level 3 pass is always on so the correcntess without that pass is not guaranteed, but this is just my guess.

For the performance, although again I’m not sure if this is the reason, you can add this line before each build. This cleans the compile engine cache to avoid incorrect measurement.

relay.backend.compile_engine.get().clear()
with tvm.transform.PassContext(...):
    g2, m2, p2 = relay.build(mod, tgt, params=params)

beomwookang · April 2, 2021, 10:53am

Thanks for the tip!

After adding the line you mentioned, the result gives me “CLEAR” even when I ran with opt_level=0.

It was very helpful:)