[AutoTVM] How to measure post-tuning layer runtimes

jwfromm · July 22, 2019, 5:59pm

I recently autotuned a model and was surprised to find the resulting model runs 2x slower than an unoptimized version. I’d like to dig into this more and understand what’s causing the massive slowdown, however I can’t find a good way to measure the per-layer runtime. This seems like an essential feature that must exist, can anyone give me a pointer to some code that enables this?

Lyken17 · July 22, 2019, 9:08pm

Adding an option set(USE_GRAPH_RUNTIME_DEBUG ON) while you build TVM enables per-layer information like this

Node Name               Ops                                                                  Time(us)   Time(%)  Start Time       End Time         Shape                Inputs  Outputs
---------               ---                                                                  --------   -------  ----------       --------         -----                ------  -------
1_NCHW1c                fuse___layout_transform___4                                          56.52      0.02     15:24:44.177475  15:24:44.177534  (1, 1, 224, 224)     1       1
_contrib_conv2d_nchwc0  fuse__contrib_conv2d_NCHWc                                           12436.11   3.4      15:24:44.177549  15:24:44.189993  (1, 1, 224, 224, 1)  2       1
relu0_NCHW8c            fuse___layout_transform___broadcast_add_relu___layout_transform__    4375.43    1.2      15:24:44.190027  15:24:44.194410  (8, 1, 5, 5, 1, 8)   2       1
_contrib_conv2d_nchwc1  fuse__contrib_conv2d_NCHWc_1                                         213108.6   58.28    15:24:44.194440  15:24:44.407558  (1, 8, 224, 224, 8)  2       1
relu1_NCHW8c            fuse___layout_transform___broadcast_add_relu___layout_transform__    2265.57    0.62     15:24:44.407600  15:24:44.409874  (64, 1, 1)           2       1
_contrib_conv2d_nchwc2  fuse__contrib_conv2d_NCHWc_2                                         104623.15  28.61    15:24:44.409905  15:24:44.514535  (1, 8, 224, 224, 8)  2       1
relu2_NCHW2c            fuse___layout_transform___broadcast_add_relu___layout_transform___1  2004.77    0.55     15:24:44.514567  15:24:44.516582  (8, 8, 3, 3, 8, 8)   2       1
_contrib_conv2d_nchwc3  fuse__contrib_conv2d_NCHWc_3                                         25218.4    6.9      15:24:44.516628  15:24:44.541856  (1, 8, 224, 224, 8)  2       1
reshape1                fuse___layout_transform___broadcast_add_reshape_transpose_reshape    1554.25    0.43     15:24:44.541893  15:24:44.543452  (64, 1, 1)           2       1

Reference: https://docs.tvm.ai/dev/debugger.html#debug-exchange-format

Lyken17 · July 23, 2019, 1:16am

@tqchen Do you think it would be better to make the verbose printing as a parameter of profile instead of makefile flags?

jwfromm · July 23, 2019, 1:25am

This works incredibly well on desktop, however it seems like I can’t use it over RPC as Ithe debug_runtime doesn’t have any remote contexts. Is there a trick to getting this to work with RPC?

tqchen · July 23, 2019, 1:36am

I thought remote runtime should work across RPC cc @srkreddy1238

jwfromm · July 23, 2019, 6:49pm

You’re right it actually does work fine, I just had to include graph_runtime_debug.cc in the compilation of the rpc client.