You can move the ‘remote’ block behind tuning process to get inference time cost. Your result shows ‘Upload’, so I guess you use the inference time cost block. In my code, I just uncomment these blocks.
@hht
Thank you for your answer.
I used your code except uncomment the return
so that I can tune.
It failed after the tune is done at lib.save(temp.relpath("graphlib.o"))
...
# We do not run the tuning in our webpage server since it takes too long.
# Comment the following line to run it by yourself.
return
# run tuning tasks
print("Tuning...")
tune_tasks(tasks, **tuning_opt)
# evaluate with tuning history
if env.TARGET != "sim":
# Get remote from fleet node
remote = autotvm.measure.request_remote(
env.TARGET, tracker_host, tracker_port, timeout=10000
)
# Reconfigure the JIT runtime and FPGA.
vta.reconfig_runtime(remote)
vta.program_fpga(remote, bitstream=None)
else:
# In simulation mode, host the RPC server locally.
remote = rpc.LocalSession()
# compile kernels with history best records
with autotvm.tophub.context(target, extra_files=[log_file]):
# Compile network
print("Compile...")
if target.device_name != "vta":
with tvm.transform.PassContext(opt_level=3, disabled_pass={"AlterOpLayout"}):
lib = relay.build(
relay_prog, target=target, params=params, target_host=env.target_host
)
else:
with vta.build_config(opt_level=3, disabled_pass={"AlterOpLayout"}):
lib = relay.build(
relay_prog, target=target, params=params, target_host=env.target_host
)
# Export library
print("Upload...")
temp = utils.tempdir()
lib.save(temp.relpath("graphlib.o")). <<<<< failed it here
remote.upload(temp.relpath("graphlib.o"))
lib = remote.load_module("graphlib.o")
# Generate the graph runtime
ctx = remote.ext_dev(0) if device == "vta" else remote.cpu(0)
m = graph_runtime.GraphModule(lib["default"](ctx))
# upload parameters to device
image = tvm.nd.array((np.random.uniform(size=(1, 3, 224, 224))).astype("float32"))
m.set_input("data", image)
# evaluate
print("Evaluate inference time cost...")
timer = m.module.time_evaluator("run", ctx, number=1, repeat=10)
tcost = timer()
prof_res = np.array(tcost.results) * 1000 # convert to millisecond
print(
"Mean inference time (std dev): %.2f ms (%.2f ms)"
% (np.mean(prof_res), np.std(prof_res))
)
Thought?
Thank you very much,
Figured it out.
Instead of save
, which is likely due to the API changes over time, export_library
works
So here is the change and it works.
# Export library
print("Upload...")
#temp = utils.tempdir()
#lib.save(temp.relpath("graphlib.o"))
#remote.upload(temp.relpath("graphlib.o"))
#lib = remote.load_module("graphlib.o")
# Send the inference library over to the remote RPC server
temp = utils.tempdir()
lib.export_library(temp.relpath("graphlib.tar"))
remote.upload(temp.relpath("graphlib.tar"))
lib = remote.load_module("graphlib.tar")
Extract tasks...
Extracted 10 conv2d tasks:
(1, 14, 14, 256, 512, 1, 1, 0, 0, 2, 2)
(1, 28, 28, 128, 256, 1, 1, 0, 0, 2, 2)
(1, 56, 56, 64, 128, 1, 1, 0, 0, 2, 2)
(1, 56, 56, 64, 64, 3, 3, 1, 1, 1, 1)
(1, 28, 28, 128, 128, 3, 3, 1, 1, 1, 1)
(1, 56, 56, 64, 128, 3, 3, 1, 1, 2, 2)
(1, 14, 14, 256, 256, 3, 3, 1, 1, 1, 1)
(1, 28, 28, 128, 256, 3, 3, 1, 1, 2, 2)
(1, 7, 7, 512, 512, 3, 3, 1, 1, 1, 1)
(1, 14, 14, 256, 512, 3, 3, 1, 1, 2, 2)
Tuning...
[Task 1/10] Current/Best: 0.00/ 0.00 GFLOPS | Progress: (10/10) | 1.87 sWARNING:root:Could not find any valid schedule for task Task(func_name=conv2d_packed.vta, args=(('TENSOR', (1, 16, 14, 14, 1, 16), 'int8'), ('TENSOR', (32, 16, 3, 3, 16, 16), 'int8'), (2, 2), (1, 1, 1, 1), (1, 1), 'NCHW1n16c', 'int32'), kwargs={}, workload=('conv2d_packed.vta', ('TENSOR', (1, 16, 14, 14, 1, 16), 'int8'), ('TENSOR', (32, 16, 3, 3, 16, 16), 'int8'), (2, 2), (1, 1, 1, 1), (1, 1), 'NCHW1n16c', 'int32')). A file containing the errors has been written to /tmp/tvm_tuning_errors_ys3zu_2v.log.
INFO:autotvm:Get devices for measurement successfully!
[Task 2/10] Current/Best: 0.00/ 70.42 GFLOPS | Progress: (10/10) | 6.26 sINFO:autotvm:Get devices for measurement successfully!
[Task 3/10] Current/Best: 0.00/ 0.00 GFLOPS | Progress: (10/10) | 4.79 sWARNING:root:Could not find any valid schedule for task Task(func_name=conv2d_packed.vta, args=(('TENSOR', (1, 8, 28, 28, 1, 16), 'int8'), ('TENSOR', (16, 8, 3, 3, 16, 16), 'int8'), (2, 2), (1, 1, 1, 1), (1, 1), 'NCHW1n16c', 'int32'), kwargs={}, workload=('conv2d_packed.vta', ('TENSOR', (1, 8, 28, 28, 1, 16), 'int8'), ('TENSOR', (16, 8, 3, 3, 16, 16), 'int8'), (2, 2), (1, 1, 1, 1), (1, 1), 'NCHW1n16c', 'int32')). A file containing the errors has been written to /tmp/tvm_tuning_errors_fqpg5ysx.log.
INFO:autotvm:Get devices for measurement successfully!
[Task 4/10] Current/Best: 31.80/ 31.80 GFLOPS | Progress: (10/10) | 8.19 sINFO:autotvm:Get devices for measurement successfully!
[Task 5/10] Current/Best: 0.00/ 25.89 GFLOPS | Progress: (10/10) | 5.99 sINFO:autotvm:Get devices for measurement successfully!
[Task 6/10] Current/Best: 0.00/ 72.11 GFLOPS | Progress: (10/10) | 6.80 sINFO:autotvm:Get devices for measurement successfully!
[Task 7/10] Current/Best: 0.00/ 19.19 GFLOPS | Progress: (10/10) | 5.65 sINFO:autotvm:Get devices for measurement successfully!
[Task 8/10] Current/Best: 0.00/ 5.28 GFLOPS | Progress: (10/10) | 7.45 sINFO:autotvm:Get devices for measurement successfully!
[Task 9/10] Current/Best: 1.21/ 5.83 GFLOPS | Progress: (10/10) | 14.37 sINFO:autotvm:Get devices for measurement successfully!
[Task 10/10] Current/Best: 0.00/ 6.53 GFLOPS | Progress: (10/10) | 4.38 sINFO:autotvm:Extract 10 best records from the vta.resnet18_v1.log.tmp
Compile...
Upload...
Evaluate inference time cost...
Mean inference time (std dev): 69.65 ms (2.26 ms)
@hht @isong Thank you for sharing your solution and issues. I’m also trying to run this “tune_relay_vta.py”. After downloading tvm v0.7 released 2020-10-02, I’ve revised “vta_config.json” (TARGET: “sim” -> “pynq” from /3rdparty/vta-hw/config/vta_config.json) and saved a bitstream file “1x16_i8w8a32_15_15_18_17.bit” in the local directory to avoid the wrong link issue. Then, in the build directory, I’ve designated the llvm route of “config.cmake” for the host PC. Likewise, I’ve revised USE_VTA_FPGA OFF -> ON of “config.cmake” for the pynq board. There seemed to be no problems with the device connection through the tracker and the bitstream file connection.
However, when I run the “tune_relay_vta.py” (I’ve commented out the code @hht explained.), it doesn’t increase GFLOPS (always 0.00/ 0.00 GFLOPS). Could you give me some advice? Do I need to change another code?
Hi @thkim
If you update your from the above githhub link and my change on the export_library
, then let me check few more.
It is not a connection issue, as you see GFLOPS (always 0.00/ 0.00 GFLOPS)
message.
Does it every finish?
How many iteration you’ve set?
Are other testing such as this that are not using autotvm
works?
Cheers, ISS
Check dmesg
from your remote device. Sometimes 0.00 GFLOPS means no bitstream writtten to FPGA.
Make sure you start remote device server with root because writing bitstream need root authority.
@isong @hht Thank you so much for your reply. I think I found a problem. When I run the tutorial “Deploy Pretrained Vision Model from MxNet on VTA”, I could write a bitstream file to FPGA, and it worked well. I’ve written vta.program_fpga(remote, bitstream=“bitfiles/v4/1x16_i8w8a32_15_15_18_17.bit”) in the sample code. I’ve checked it through dmesg.
However, when I run the tutorial “Auto-tuning a convolutional network on VTA” (tune_relay_vta_with_one_board.py), I couldn’t write a bitstream file to FPGA. During the tuning, it seems to access the program_fpga function of rpc_client.py per 8 iterations and get the bitstream="/home/taeho/.vta_cache/pynq/0_0_1/1x16_i8w8a32_15_15_18_17.bit". When I checked dmesg from my pynq board, it failed to write a bitstream to FPGA. I couldn’t find the exact reason.
For the tutorial “Deploy Pretrained Vision Model from MxNet on VTA”, I’ve run “sudo apps/vta_rpc/start_rpc_server.sh” on pynq. For tune_relay_vta_with_one_board.py, I’ve run “sudo apps/vta_rpc/start_rpc_server_to_tracker.sh” on pynq. @hht I’m not sure if it is correct to start remote device server with root.
Again, thank you so much for your help!
Hi @thkim
Did you check if your bit file is at /home/taeho/.vta_cache/pynq/0_0_1/1x16_i8w8a32_15_15_18_17.bit"
?
In my case, I copy the bit file to the above dir.
I also use sudo
to run tracker
pushd apps/vta_rpc
sudo -E ./start_rpc_server_to_tracker.sh
popd
Hi @thkim
I use su
to start the tracker and modify the corresponding environment variables.
TVM_HOME="/home/xilinx/tvm"
VTA_HW_PATH="/home/xilinx/tvm/3rdparty/vta-hw"
PYTHONPATH="/home/xilinx/tvm/python:/home/xilinx/tvm/topi/python:/home/xilinx/tvm/vta/python"
root@pynq:/home/xilinx# cat tvm/apps/vta_rpc/start_rpc_server_to_tracker.sh
#!/bin/bash
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements. See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership. The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing,
# software distributed under the License is distributed on an
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
# KIND, either express or implied. See the License for the
# specific language governing permissions and limitations
# under the License.
PROJROOT="$( cd "$( dirname '${BASH_SOURCE[0]}' )/../../" && pwd )"
# Derive target specified by vta_config.json
VTA_CONFIG=/home/xilinx/tvm/3rdparty/vta-hw/config/vta_config.py
TARGET=$(python ${VTA_CONFIG} --target)
export PYTHONPATH=${PYTHONPATH}:${PROJROOT}/python:${PROJROOT}/vta/python
export PYTHONPATH=${PYTHONPATH}:/home/xilinx/pynq
python3 -m vta.exec.rpc_server --tracker 192.168.0.114:9190 --key $TARGET
@isong @hht Thank you so much for your response. Although I’ve checked a bit file is in /home/taeho/.vta_cache/pynq/0_0_1/1x16_i8w8a32_15_15_18_17.bit, there was a problem. Instead, in the rpc_client.py, I’ve written bitstream = “dir” which is the bit file directory I’ve saved in the local dir. It works now. Thanks! Yes, I also wrote the environment variables like you!
I’ll work on this continuously, so I hope we can share something in the future. Thank you, again
@thkim Glad to hear your good news. I have been working to fix the memory leak caused by rpc exception. VTA use CMA memory and CMA is managed by /dev/xlnk. Directly terminating the process will only release heap and stack memory but still cause a memory leak. I used two ways to make the workaround but can’t find an elegant solution. There are a lot of things to do and to discuss.
Hi @hht
Interesting, I am interested in what’s your 2 ways fo workarounds? Would you mind sharing them?
Thank you,
@thkim, Glad to hear that you sorted out things.
Cheers,
One way is to create a global class record cma alloc and free. When exception occurs, the python-rpc-server firstly provided the C API to invoke xilinx cma_free() to release the cma memory then terminate the process.
Another way is to avoid rpc exception and improve tuning efficiency. I replace batch 300 input tensor with batch 1 input tensor. And use simple python script the generate simpler net. I thought that with GEMM batch 300 conv2d could share nearly the same optimal schedule with batch 1 conv2d.
Hi @htt
Thank you for sharing your work. It seems the 2nd option made more sense, though it seems to be good to improve rpc exception handling.
Cheers, ISS
Hi,
I want to check several things.
- Does your tutorial work with ‘xgb’ tuner?
- I tried to compare non-tuning with tuning and failed to see the improvement for execution time after tuning. Did you see the improvement after tuning?
- @isong I’m using one PYNQ-Z1 board. When I ran the tutorial, the mean inference time was 365 ~ 372 ms. Your example shows 69.65 ms. Could you give me some advice to decrease the mean inference time?
Thanks!
-
I am not sure, but with random tunner it works.
-
Because you use the tophub well-tuned params. If you try with fallback config, it will be great improvement.
-
I use PYNQ Z1 and ZCU104. I think 365ms is reasonable.