Debug graph runtime support for Latest TVM

aakah18151 · December 30, 2020, 10:14am

I want to find out latency of each layer for my target hardware. I am trying to use graph debugger from tvm.contrib.debugger for this. The documentation shows that debug_runtime.create expects 3 parameters, i.e graph, lib and ctx. In the latest TVM version, while building using relay.build we only define lib=relay.build(…), The earlier relay.build where we generate graph seems to be deprecated also. How do I get the graph info now to pass into debug_runtime? Or how do I use graph debugger in general with latest TVM build.

@comaniac

FrozenGene · December 31, 2020, 7:20am

You could refer def test_debug_graph_runtime(): in the tests/python/unittest/test_runtime_module_based_interface.py. I showed the usage there.

aakah18151 · January 3, 2021, 5:02pm

@FrozenGene Thanks for the response . I followed the usage and it worked fine on my x86 with target LLVM. However when I want to profile layers on some other target using rpc, for example rpi it gives me an error.

File “/home/tvm/src/runtime/library_module.cc”,

Line 78

TVMError:

An internal invariant was violated during the execution of TVM.

Please read TVM’s error reporting guidelines.

More details can be found here: https://discuss.tvm.ai/t/error-reporting/7793.

Check failed: ret == 0 (-1 vs. 0) : Assert fail: (1 == tir.tvm struct get (arg0, 0, 10)), Argument arg0.device type has an unsatisfied constraint: (1 == tir.tvm struct get (arg0, 0, 10))

I’ve build runtime with graph debugger on my target hardware and host, both.

Any help would be appreciated.

Thanks

FrozenGene · January 4, 2021, 2:10am

If there is no debug runtime, does it work?

If it works but fails on debug graph runtime, could you list the complete script?

aakah18151 · January 4, 2021, 8:38am

@FrozenGene I’ve modified the script from : https://tvm.apache.org/docs/tutorials/frontend/deploy_model_on_rasp.html for using graph debugger. Its not working for both local_demo=True and local_demo=False. Without graph debugger the script runs fine in both cases.

Here is the script

import tvm
from tvm import te
import tvm.relay as relay
from tvm import rpc
from tvm.contrib import utils, graph_runtime as runtime
from tvm.contrib.download import download_testdata
from mxnet.gluon.model_zoo.vision import get_model
from PIL import Image
import numpy as np
block = get_model("resnet18_v1", pretrained=True)
img_url = "https://github.com/dmlc/mxnet.js/blob/main/data/cat.png?raw=true"
img_name = "cat.png"
img_path = download_testdata(img_url, img_name, module="data")
image = Image.open(img_path).resize((224, 224))
def transform_image(image):
    image = np.array(image) - np.array([123.0, 117.0, 104.0])
    image /= np.array([58.395, 57.12, 57.375])
    image = image.transpose((2, 0, 1))
    image = image[np.newaxis, :]
    return image

x = transform_image(image)
synset_url = "".join(
    [
        "https://gist.githubusercontent.com/zhreshold/",
        "4d0b62f3d01426887599d4f7ede23ee5/raw/",
        "596b27d23537e5a1b5751d2b0481ef172f58b539/",
        "imagenet1000_clsid_to_human.txt",
    ]
)
synset_name = "imagenet1000_clsid_to_human.txt"
synset_path = download_testdata(synset_url, synset_name, module="data")
with open(synset_path) as f:
    synset = eval(f.read())
shape_dict = {"data": x.shape}
mod, params = relay.frontend.from_mxnet(block, shape_dict)
func = mod["main"]
func = relay.Function(func.params, relay.nn.softmax(func.body), None, func.type_params, func.attrs)
from tvm.contrib import utils, graph_runtime as runtime

local_demo = True

if local_demo:
    target = tvm.target.Target("llvm")
else:
    target = tvm.target.arm_cpu("rasp3b")

with tvm.transform.PassContext(opt_level=3):
    compiled_graph_lib = relay.build_module.build(mod, target, params=params)
tmp = utils.tempdir()
lib_fname = tmp.relpath("net.tar")
compiled_graph_lib.export_library(lib_fname)
from tvm import rpc
from tvm.contrib.debugger import debug_runtime as graph_runtime
if local_demo:
    remote = rpc.LocalSession()
else:
    host = "192.168.1.12"
    port = 9090
    remote = rpc.connect(host, port)
remote.upload(lib_fname)
rlib = remote.load_module("net.tar")
ctx = remote.cpu(0)
gmod = compiled_graph_lib["debug_create"]("default", ctx)

set_input = gmod["set_input"]
run = gmod["run"]
get_output = gmod["get_output"]
set_input("data", tvm.nd.array(x.astype("float32")))
run()
out = get_output(0)
print(out)
debug_g_mod = graph_runtime.GraphModuleDebug(
        compiled_graph_lib["debug_create"]("default", ctx),
        [ctx],
        compiled_graph_lib.get_json(),
        None,
    )
debug_g_mod.set_input("data", tvm.nd.array(x.astype("float32")))
debug_g_mod.run()
out_db = debug_g_mod.get_output(0).asnumpy()
print(out_db)


Traceback (most recent call last):
  File "Test.py", line 70, in <module>
    run()
  File "/home/tvm/python/tvm/_ffi/_ctypes/packed_func.py", line 237, in __call__
    raise get_last_ffi_error()
tvm._ffi.base.TVMError: Traceback (most recent call last):
  [bt] (4) /home/tvm/build/libtvm.so(TVMFuncCall+0x65) [0x7f522d54e4e5]
  [bt] (3) /home/tvm/build/libtvm.so(tvm::runtime::GraphRuntime::Run()+0x37) [0x7f522d5d6ec7]
  [bt] (2) /home/tvm/build/libtvm.so(+0x12a6e47) [0x7f522d5d6e47]
  [bt] (1) /home/tvm/build/libtvm.so(+0x1230f55) [0x7f522d560f55]
  [bt] (0) /home/tvm/build/libtvm.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x82) [0x7f522c923e12]
  File "/home/tvm/src/runtime/library_module.cc", line 78
TVMError: 
---------------------------------------------------------------
An internal invariant was violated during the execution of TVM.
Please read TVM's error reporting guidelines.
More details can be found here: https://discuss.tvm.ai/t/error-reporting/7793.
---------------------------------------------------------------

  Check failed: ret == 0 (-1 vs. 0) : Assert fail: (1 == tir.tvm_struct_get(arg0, 0, 10)), Argument arg0.device_type has an unsatisfied constraint: (1 == tir.tvm_struct_get(arg0, 0, 10))

FrozenGene · January 5, 2021, 3:31am

@aakah18151

i think the problem is you should use rlib instead of compiled_graph_lib. see def verify_rpc_cpu_export for reference.

aakah18151 · January 7, 2021, 12:05pm

Hi @FrozenGene Thanks for the help, using rlib solved the rpc problem. However the profiling data seems to be a bit off.

Node Name                                         Ops                                              Time(us)  Time(%)  Shape                 Inputs  Outputs  
---------                                         ---                                              --------  -------  -----                 ------  -------  
fused_nn_contrib_conv2d_NCHWc_add_nn_relu         fused_nn_contrib_conv2d_NCHWc_add_nn_relu        0.007     6.711    (1, 2, 112, 112, 32)  3       1        
fused_nn_contrib_conv2d_NCHWc_add_nn_relu_1       fused_nn_contrib_conv2d_NCHWc_add_nn_relu_1      0.007     6.252    (1, 2, 56, 56, 32)    3       1        
fused_nn_contrib_conv2d_NCHWc_add_add_nn_relu_61  fused_nn_contrib_conv2d_NCHWc_add_add_nn_relu_6  0.007     6.319    (1, 2, 56, 56, 32)    4       1        
fused_nn_contrib_conv2d_NCHWc_add_nn_relu_11      fused_nn_contrib_conv2d_NCHWc_add_nn_relu_1      0.007     6.248    (1, 2, 56, 56, 32)    3       1        
fused_nn_contrib_conv2d_NCHWc_add_add_nn_relu_6   fused_nn_contrib_conv2d_NCHWc_add_add_nn_relu_6  0.007     6.287    (1, 2, 56, 56, 32)    4       1        
fused_nn_contrib_conv2d_NCHWc_add                 fused_nn_contrib_conv2d_NCHWc_add                0.007     6.432    (1, 4, 28, 28, 32)    3       1        
fused_nn_contrib_conv2d_NCHWc_add_nn_relu_3       fused_nn_contrib_conv2d_NCHWc_add_nn_relu_3      0.007     6.203    (1, 4, 28, 28, 32)    3       1        
fused_nn_contrib_conv2d_NCHWc_add_add_nn_relu_4   fused_nn_contrib_conv2d_NCHWc_add_add_nn_relu_4  0.007     6.159    (1, 4, 28, 28, 32)    4       1        
fused_nn_contrib_conv2d_NCHWc_add_1               fused_nn_contrib_conv2d_NCHWc_add_1              0.007     6.086    (1, 16, 14, 14, 16)   3       1        
fused_nn_contrib_conv2d_NCHWc_add_nn_relu_5       fused_nn_contrib_conv2d_NCHWc_add_nn_relu_5      0.007     6.052    (1, 16, 14, 14, 16)   3       1        
fused_nn_contrib_conv2d_NCHWc_add_2               fused_nn_contrib_conv2d_NCHWc_add_2              0.007     6.217    (1, 16, 7, 7, 32)     3       1        
fused_nn_contrib_conv2d_NCHWc_add_nn_relu_7       fused_nn_contrib_conv2d_NCHWc_add_nn_relu_7      0.007     6.253    (1, 16, 7, 7, 32)     3       1        
fused_nn_contrib_conv2d_NCHWc_add_add_nn_relu     fused_nn_contrib_conv2d_NCHWc_add_add_nn_relu    0.007     6.264    (1, 16, 7, 7, 32)     4       1        
fused_nn_contrib_conv2d_NCHWc_add_add_nn_relu_2   fused_nn_contrib_conv2d_NCHWc_add_add_nn_relu_2  0.006     6.014    (1, 16, 14, 14, 16)   4       1        
fused_nn_contrib_conv2d_NCHWc_add_nn_relu_2       fused_nn_contrib_conv2d_NCHWc_add_nn_relu_2      0.004     3.878    (1, 2, 28, 28, 64)    3       1        
fused_nn_contrib_conv2d_NCHWc_add_nn_relu_6       fused_nn_contrib_conv2d_NCHWc_add_nn_relu_6      0.004     3.299    (1, 16, 7, 7, 32)     3       1        
fused_nn_contrib_conv2d_NCHWc_add_nn_relu_4       fused_nn_contrib_conv2d_NCHWc_add_nn_relu_4      0.003     3.245    (1, 8, 14, 14, 32)    3       1        
fused_layout_transform_20                         fused_layout_transform_20                        0.0       0.071    (1, 1, 224, 224, 3)   1       1        
fused_nn_max_pool2d                               fused_nn_max_pool2d                              0.0       0.338    (1, 2, 56, 56, 32)    1       1        
fused_layout_transform_211                        fused_layout_transform_21                        0.0       0.027    (1, 1, 56, 56, 64)    1       1        
fused_layout_transform_21                         fused_layout_transform_21                        0.0       0.057    (1, 1, 56, 56, 64)    1       1        
fused_layout_transform_213                        fused_layout_transform_21                        0.0       0.047    (1, 1, 56, 56, 64)    1       1        
fused_layout_transform_212                        fused_layout_transform_21                        0.0       0.056    (1, 1, 56, 56, 64)    1       1        
fused_layout_transform_19                         fused_layout_transform_19                        0.0       0.048    (1, 8, 56, 56, 8)     1       1        
fused_layout_transform_22                         fused_layout_transform_22                        0.0       0.019    (1, 1, 28, 28, 128)   1       1        
fused_nn_contrib_conv2d_NCHWc_add_add_nn_relu_5   fused_nn_contrib_conv2d_NCHWc_add_add_nn_relu_5  0.0       0.401    (1, 4, 28, 28, 32)    4       1        
fused_layout_transform_231                        fused_layout_transform_23                        0.0       0.017    (1, 1, 28, 28, 128)   1       1        
fused_layout_transform_23                         fused_layout_transform_23                        0.0       0.016    (1, 1, 28, 28, 128)   1       1        
fused_layout_transform_18                         fused_layout_transform_18                        0.0       0.014    (1, 2, 28, 28, 64)    1       1        
fused_layout_transform_25                         fused_layout_transform_25                        0.0       0.012    (1, 16, 28, 28, 8)    1       1        
fused_layout_transform_24                         fused_layout_transform_24                        0.0       0.01     (1, 2, 14, 14, 128)   1       1        
fused_nn_contrib_conv2d_NCHWc_add_add_nn_relu_3   fused_nn_contrib_conv2d_NCHWc_add_add_nn_relu_3  0.0       0.368    (1, 16, 14, 14, 16)   4       1        
fused_layout_transform_172                        fused_layout_transform_17                        0.0       0.01     (1, 2, 14, 14, 128)   1       1        
fused_layout_transform_171                        fused_layout_transform_17                        0.0       0.012    (1, 2, 14, 14, 128)   1       1        
fused_layout_transform_17                         fused_layout_transform_17                        0.0       0.014    (1, 2, 14, 14, 128)   1       1        
fused_layout_transform_27                         fused_layout_transform_27                        0.0       0.008    (1, 32, 14, 14, 8)    1       1        
fused_layout_transform_26                         fused_layout_transform_26                        0.0       0.012    (1, 1, 7, 7, 512)     1       1        
fused_nn_contrib_conv2d_NCHWc_add_add_nn_relu_1   fused_nn_contrib_conv2d_NCHWc_add_add_nn_relu_1  0.0       0.362    (1, 16, 7, 7, 32)     4       1        
fused_layout_transform_262                        fused_layout_transform_26                        0.0       0.008    (1, 1, 7, 7, 512)     1       1        
fused_layout_transform_261                        fused_layout_transform_26                        0.0       0.014    (1, 1, 7, 7, 512)     1       1        
fused_nn_global_avg_pool2d                        fused_nn_global_avg_pool2d                       0.0       0.015    (1, 16, 1, 1, 32)     1       1        
fused_layout_transform_nn_batch_flatten           fused_layout_transform_nn_batch_flatten          0.0       0.004    (1, 512)              1       1        
fused_nn_dense_add                                fused_nn_dense_add                               0.0       0.122    (1, 1000)             3       1        
Total_time                                        -                                                0.108     -        -                     -       -        
END RUN
Runtime:             106.40 ms           (0.00 ms)

The Runtime: is the inference time using time evaluator. I think that even though the header of profiling data says us(micro seconds) the data actually is printing in seconds. Is this a bug or am I missing something here?

FrozenGene · January 7, 2021, 3:09pm

it is a bug. Welcome to pull a request. It is related with this issue: https://github.com/apache/tvm/issues/7207

areusch · January 7, 2021, 10:08pm

uploaded https://github.com/apache/tvm/pull/7227

apologies for the breakage

aakah18151 · January 8, 2021, 1:04pm

Thanks @areusch @FrozenGene