I recently autotuned a model and was surprised to find the resulting model runs 2x slower than an unoptimized version. I’d like to dig into this more and understand what’s causing the massive slowdown, however I can’t find a good way to measure the per-layer runtime. This seems like an essential feature that must exist, can anyone give me a pointer to some code that enables this?
Adding an option set(USE_GRAPH_RUNTIME_DEBUG ON)
while you build TVM enables per-layer information like this
Node Name Ops Time(us) Time(%) Start Time End Time Shape Inputs Outputs
--------- --- -------- ------- ---------- -------- ----- ------ -------
1_NCHW1c fuse___layout_transform___4 56.52 0.02 15:24:44.177475 15:24:44.177534 (1, 1, 224, 224) 1 1
_contrib_conv2d_nchwc0 fuse__contrib_conv2d_NCHWc 12436.11 3.4 15:24:44.177549 15:24:44.189993 (1, 1, 224, 224, 1) 2 1
relu0_NCHW8c fuse___layout_transform___broadcast_add_relu___layout_transform__ 4375.43 1.2 15:24:44.190027 15:24:44.194410 (8, 1, 5, 5, 1, 8) 2 1
_contrib_conv2d_nchwc1 fuse__contrib_conv2d_NCHWc_1 213108.6 58.28 15:24:44.194440 15:24:44.407558 (1, 8, 224, 224, 8) 2 1
relu1_NCHW8c fuse___layout_transform___broadcast_add_relu___layout_transform__ 2265.57 0.62 15:24:44.407600 15:24:44.409874 (64, 1, 1) 2 1
_contrib_conv2d_nchwc2 fuse__contrib_conv2d_NCHWc_2 104623.15 28.61 15:24:44.409905 15:24:44.514535 (1, 8, 224, 224, 8) 2 1
relu2_NCHW2c fuse___layout_transform___broadcast_add_relu___layout_transform___1 2004.77 0.55 15:24:44.514567 15:24:44.516582 (8, 8, 3, 3, 8, 8) 2 1
_contrib_conv2d_nchwc3 fuse__contrib_conv2d_NCHWc_3 25218.4 6.9 15:24:44.516628 15:24:44.541856 (1, 8, 224, 224, 8) 2 1
reshape1 fuse___layout_transform___broadcast_add_reshape_transpose_reshape 1554.25 0.43 15:24:44.541893 15:24:44.543452 (64, 1, 1) 2 1
Reference: https://docs.tvm.ai/dev/debugger.html#debug-exchange-format
1 Like
@tqchen Do you think it would be better to make the verbose printing as a parameter of profile instead of makefile flags?
This works incredibly well on desktop, however it seems like I can’t use it over RPC as Ithe debug_runtime doesn’t have any remote contexts. Is there a trick to getting this to work with RPC?
You’re right it actually does work fine, I just had to include graph_runtime_debug.cc in the compilation of the rpc client.
1 Like