Hello,
I had a similar problem, although not for Android, but an embedded ARM/Linux board. So here is how it worked for me, maybe a similar solution works for you too:
-
If you want to be 100% sure to avoid inconsistencies, build the SAME TVM for the host and embedded target.
-
Building for the x86_64/Linux - follow this tutorial
- Building for the ARM/Linux: I’ve used a cross-compilation approach:
- Copy your main tvm folder into something like tvm-arm
- Clear your build directory, but perhaps leave your config.cmake
- Important to have the following options enabled for the embedded target:
set(USE_RPC ON)
set(USE_GRAPH_RUNTIME ON)
set(USE_GRAPH_RUNTIME_DEBUG ON)
- Set the path to your cross-compiler in the shell you are going to invoke make from:
(Please adapt paths to correct location)
export CC=/my/path/to/gcc-arm-8.3-2019.03-x86_64-aarch64-linux-gnu/bin/aarch64-linux-gnu-gcc
export CXX=/my/path/to/gcc-arm-8.3-2019.03-x86_64-aarch64-linux-gnu/bin/aarch64-linux-gnu-g++
cd build
cmake ..
make -j 8
- Copy over the whole tvm-arm directory, but at least the build directory (and the shared objects within) to your embedded system
- Set the following environment variables on the embedded device in the shell you are going to start the TVM RPC from:
(Please adapt paths to correct location)
export TVM_HOME=/home/root/tvm-arm
export PYTHONPATH=$TVM_HOME/python:$TVM_HOME/topi/python:${PYTHONPATH}
export LD_LIBRARY_PATH=/usr/lib:$LD_LIBRARY_PATH
- Now you can start an RPC on your embedded device as follows:
python -m tvm.exec.rpc_server --host 0.0.0.0 --port=9090 --no-fork # <- use the same port in the target application later
- From your TVM host application, you need to import and use the debug runtime:
from tvm.contrib.debugger import debug_runtime as graph_runtime
- Whenever you build the graph runtime, you invoke it as follows:
...
rtmodule = graph_runtime.create(graph, rlib, ctx, dump_root='/tmp/tvmdbg/')
...
- Now whenever you invoke run(), the debug runtime will give you some profiling information from the embedded device, e.g.:
Node Name Ops Time(us) Time(%) Start Time End Time Shape Inputs Outputs
--------- --- -------- ------- ---------- -------- ----- ------ -------
1_NCHW1c fuse___layout_transform___4 56.52 0.02 15:24:44.177475 15:24:44.177534 (1, 1, 224, 224) 1 1
_contrib_conv2d_nchwc0 fuse__contrib_conv2d_NCHWc 12436.11 3.4 15:24:44.177549 15:24:44.189993 (1, 1, 224, 224, 1) 2 1
relu0_NCHW8c fuse___layout_transform___broadcast_add_relu___layout_transform__ 4375.43 1.2 15:24:44.190027 15:24:44.194410 (8, 1, 5, 5, 1, 8) 2 1
_contrib_conv2d_nchwc1 fuse__contrib_conv2d_NCHWc_1 213108.6 58.28 15:24:44.194440 15:24:44.407558 (1, 8, 224, 224, 8) 2 1
relu1_NCHW8c fuse___layout_transform___broadcast_add_relu___layout_transform__ 2265.57 0.62 15:24:44.407600 15:24:44.409874 (64, 1, 1) 2 1
_contrib_conv2d_nchwc2 fuse__contrib_conv2d_NCHWc_2 104623.15 28.61 15:24:44.409905 15:24:44.514535 (1, 8, 224, 224, 8) 2 1
relu2_NCHW2c fuse___layout_transform___broadcast_add_relu___layout_transform___1 2004.77 0.55 15:24:44.514567 15:24:44.516582 (8, 8, 3, 3, 8, 8) 2 1
_contrib_conv2d_nchwc3 fuse__contrib_conv2d_NCHWc_3 25218.4 6.9 15:24:44.516628 15:24:44.541856 (1, 8, 224, 224, 8) 2 1
reshape1 fuse___layout_transform___broadcast_add_reshape_transpose_reshape 1554.25 0.43 15:24:44.541893 15:24:44.543452 (64, 1, 1) 2 1
Maybe It won’t work for you exactly like this, but the steps must be similar! Good luck!
Cheers,
Robert