[VTA] Workaround for Autotuning with One PYNQ Z1 Board

hht · November 23, 2020, 10:55am

You can move the ‘remote’ block behind tuning process to get inference time cost. Your result shows ‘Upload’, so I guess you use the inference time cost block. In my code, I just uncomment these blocks.

isong · November 23, 2020, 11:31am

@hht Thank you for your answer. I used your code except uncomment the return so that I can tune. It failed after the tune is done at lib.save(temp.relpath("graphlib.o"))

github.com

i24361/incubator-tvm/blob/0472b1f347976229a29be8a6e60b626a0604c8df/vta/tutorials/autotvm/tune_relay_vta_with_one_board.py

# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements.  See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership.  The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License.  You may obtain a copy of the License at
#
#   http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing,
# software distributed under the License is distributed on an
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
# KIND, either express or implied.  See the License for the
# specific language governing permissions and limitations
# under the License.
"""
Auto-tuning a convolutional network on VTA
==========================================
**Author**: `Lianmin Zheng <https://github.com/merrymercy>`_, `Thierry Moreau <https://homes.cs.washington.edu/~moreau/>`_

This file has been truncated. show original

    ...
     # We do not run the tuning in our webpage server since it takes too long.
     # Comment the following line to run it by yourself.
     return
 
     # run tuning tasks
     print("Tuning...")
     tune_tasks(tasks, **tuning_opt)
 
     # evaluate with tuning history
     if env.TARGET != "sim":
         # Get remote from fleet node
         remote = autotvm.measure.request_remote(
             env.TARGET, tracker_host, tracker_port, timeout=10000
         )
         # Reconfigure the JIT runtime and FPGA.
         vta.reconfig_runtime(remote)
         vta.program_fpga(remote, bitstream=None)
     else:
         # In simulation mode, host the RPC server locally.
         remote = rpc.LocalSession()
 
     # compile kernels with history best records
     with autotvm.tophub.context(target, extra_files=[log_file]):
         # Compile network
         print("Compile...")
         if target.device_name != "vta":
             with tvm.transform.PassContext(opt_level=3, disabled_pass={"AlterOpLayout"}):
                 lib = relay.build(
                     relay_prog, target=target, params=params, target_host=env.target_host
                 )
         else:
             with vta.build_config(opt_level=3, disabled_pass={"AlterOpLayout"}):
                 lib = relay.build(
                     relay_prog, target=target, params=params, target_host=env.target_host
                 )
 
         # Export library
         print("Upload...")
         temp = utils.tempdir()
         lib.save(temp.relpath("graphlib.o")). <<<<< failed it here
         remote.upload(temp.relpath("graphlib.o"))
         lib = remote.load_module("graphlib.o")
 
         # Generate the graph runtime
         ctx = remote.ext_dev(0) if device == "vta" else remote.cpu(0)
         m = graph_runtime.GraphModule(lib["default"](ctx))
 
         # upload parameters to device
         image = tvm.nd.array((np.random.uniform(size=(1, 3, 224, 224))).astype("float32"))
         m.set_input("data", image)
 
         # evaluate
         print("Evaluate inference time cost...")
         timer = m.module.time_evaluator("run", ctx, number=1, repeat=10)
         tcost = timer()
         prof_res = np.array(tcost.results) * 1000  # convert to millisecond
         print(
             "Mean inference time (std dev): %.2f ms (%.2f ms)"
             % (np.mean(prof_res), np.std(prof_res))
         )

Thought?

Thank you very much,

isong · November 23, 2020, 11:54am

Figured it out. Instead of save, which is likely due to the API changes over time, export_library works

So here is the change and it works.

        # Export library
        print("Upload...")
        #temp = utils.tempdir()
        #lib.save(temp.relpath("graphlib.o"))
        #remote.upload(temp.relpath("graphlib.o"))
        #lib = remote.load_module("graphlib.o")

        # Send the inference library over to the remote RPC server
        temp = utils.tempdir()
        lib.export_library(temp.relpath("graphlib.tar"))
        remote.upload(temp.relpath("graphlib.tar"))
        lib = remote.load_module("graphlib.tar")

Extract tasks...
Extracted 10 conv2d tasks:
(1, 14, 14, 256, 512, 1, 1, 0, 0, 2, 2)
(1, 28, 28, 128, 256, 1, 1, 0, 0, 2, 2)
(1, 56, 56, 64, 128, 1, 1, 0, 0, 2, 2)
(1, 56, 56, 64, 64, 3, 3, 1, 1, 1, 1)
(1, 28, 28, 128, 128, 3, 3, 1, 1, 1, 1)
(1, 56, 56, 64, 128, 3, 3, 1, 1, 2, 2)
(1, 14, 14, 256, 256, 3, 3, 1, 1, 1, 1)
(1, 28, 28, 128, 256, 3, 3, 1, 1, 2, 2)
(1, 7, 7, 512, 512, 3, 3, 1, 1, 1, 1)
(1, 14, 14, 256, 512, 3, 3, 1, 1, 2, 2)
Tuning...
[Task  1/10]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (10/10) | 1.87 sWARNING:root:Could not find any valid schedule for task Task(func_name=conv2d_packed.vta, args=(('TENSOR', (1, 16, 14, 14, 1, 16), 'int8'), ('TENSOR', (32, 16, 3, 3, 16, 16), 'int8'), (2, 2), (1, 1, 1, 1), (1, 1), 'NCHW1n16c', 'int32'), kwargs={}, workload=('conv2d_packed.vta', ('TENSOR', (1, 16, 14, 14, 1, 16), 'int8'), ('TENSOR', (32, 16, 3, 3, 16, 16), 'int8'), (2, 2), (1, 1, 1, 1), (1, 1), 'NCHW1n16c', 'int32')). A file containing the errors has been written to /tmp/tvm_tuning_errors_ys3zu_2v.log.
INFO:autotvm:Get devices for measurement successfully!
[Task  2/10]  Current/Best:    0.00/  70.42 GFLOPS | Progress: (10/10) | 6.26 sINFO:autotvm:Get devices for measurement successfully!
[Task  3/10]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (10/10) | 4.79 sWARNING:root:Could not find any valid schedule for task Task(func_name=conv2d_packed.vta, args=(('TENSOR', (1, 8, 28, 28, 1, 16), 'int8'), ('TENSOR', (16, 8, 3, 3, 16, 16), 'int8'), (2, 2), (1, 1, 1, 1), (1, 1), 'NCHW1n16c', 'int32'), kwargs={}, workload=('conv2d_packed.vta', ('TENSOR', (1, 8, 28, 28, 1, 16), 'int8'), ('TENSOR', (16, 8, 3, 3, 16, 16), 'int8'), (2, 2), (1, 1, 1, 1), (1, 1), 'NCHW1n16c', 'int32')). A file containing the errors has been written to /tmp/tvm_tuning_errors_fqpg5ysx.log.
INFO:autotvm:Get devices for measurement successfully!
[Task  4/10]  Current/Best:   31.80/  31.80 GFLOPS | Progress: (10/10) | 8.19 sINFO:autotvm:Get devices for measurement successfully!
[Task  5/10]  Current/Best:    0.00/  25.89 GFLOPS | Progress: (10/10) | 5.99 sINFO:autotvm:Get devices for measurement successfully!
[Task  6/10]  Current/Best:    0.00/  72.11 GFLOPS | Progress: (10/10) | 6.80 sINFO:autotvm:Get devices for measurement successfully!
[Task  7/10]  Current/Best:    0.00/  19.19 GFLOPS | Progress: (10/10) | 5.65 sINFO:autotvm:Get devices for measurement successfully!
[Task  8/10]  Current/Best:    0.00/   5.28 GFLOPS | Progress: (10/10) | 7.45 sINFO:autotvm:Get devices for measurement successfully!
[Task  9/10]  Current/Best:    1.21/   5.83 GFLOPS | Progress: (10/10) | 14.37 sINFO:autotvm:Get devices for measurement successfully!
[Task 10/10]  Current/Best:    0.00/   6.53 GFLOPS | Progress: (10/10) | 4.38 sINFO:autotvm:Extract 10 best records from the vta.resnet18_v1.log.tmp
Compile...
Upload...
Evaluate inference time cost...
Mean inference time (std dev): 69.65 ms (2.26 ms)

thkim · November 25, 2020, 7:50pm

@hht @isong Thank you for sharing your solution and issues. I’m also trying to run this “tune_relay_vta.py”. After downloading tvm v0.7 released 2020-10-02, I’ve revised “vta_config.json” (TARGET: “sim” -> “pynq” from /3rdparty/vta-hw/config/vta_config.json) and saved a bitstream file “1x16_i8w8a32_15_15_18_17.bit” in the local directory to avoid the wrong link issue. Then, in the build directory, I’ve designated the llvm route of “config.cmake” for the host PC. Likewise, I’ve revised USE_VTA_FPGA OFF -> ON of “config.cmake” for the pynq board. There seemed to be no problems with the device connection through the tracker and the bitstream file connection.

However, when I run the “tune_relay_vta.py” (I’ve commented out the code @hht explained.), it doesn’t increase GFLOPS (always 0.00/ 0.00 GFLOPS). Could you give me some advice? Do I need to change another code?

isong · November 26, 2020, 1:02am

Hi @thkim

If you update your from the above githhub link and my change on the export_library, then let me check few more.

It is not a connection issue, as you see GFLOPS (always 0.00/ 0.00 GFLOPS) message. Does it every finish? How many iteration you’ve set? Are other testing such as this that are not using autotvm works?

Cheers, ISS

hht · November 26, 2020, 1:36am

Check dmesg from your remote device. Sometimes 0.00 GFLOPS means no bitstream writtten to FPGA.

hht · November 26, 2020, 1:39am

Make sure you start remote device server with root because writing bitstream need root authority.

thkim · November 26, 2020, 7:53am

@isong @hht Thank you so much for your reply. I think I found a problem. When I run the tutorial “Deploy Pretrained Vision Model from MxNet on VTA”, I could write a bitstream file to FPGA, and it worked well. I’ve written vta.program_fpga(remote, bitstream=“bitfiles/v4/1x16_i8w8a32_15_15_18_17.bit”) in the sample code. I’ve checked it through dmesg.

However, when I run the tutorial “Auto-tuning a convolutional network on VTA” (tune_relay_vta_with_one_board.py), I couldn’t write a bitstream file to FPGA. During the tuning, it seems to access the program_fpga function of rpc_client.py per 8 iterations and get the bitstream="/home/taeho/.vta_cache/pynq/0_0_1/1x16_i8w8a32_15_15_18_17.bit". When I checked dmesg from my pynq board, it failed to write a bitstream to FPGA. I couldn’t find the exact reason.

For the tutorial “Deploy Pretrained Vision Model from MxNet on VTA”, I’ve run “sudo apps/vta_rpc/start_rpc_server.sh” on pynq. For tune_relay_vta_with_one_board.py, I’ve run “sudo apps/vta_rpc/start_rpc_server_to_tracker.sh” on pynq. @hht I’m not sure if it is correct to start remote device server with root.

Again, thank you so much for your help!

isong · November 26, 2020, 8:16am

Hi @thkim

Did you check if your bit file is at /home/taeho/.vta_cache/pynq/0_0_1/1x16_i8w8a32_15_15_18_17.bit"?

In my case, I copy the bit file to the above dir.

I also use sudo to run tracker

pushd apps/vta_rpc
sudo -E ./start_rpc_server_to_tracker.sh
popd

hht · November 26, 2020, 8:33am

Hi @thkim

I use su to start the tracker and modify the corresponding environment variables.

TVM_HOME="/home/xilinx/tvm"
VTA_HW_PATH="/home/xilinx/tvm/3rdparty/vta-hw"
PYTHONPATH="/home/xilinx/tvm/python:/home/xilinx/tvm/topi/python:/home/xilinx/tvm/vta/python"

hht · November 26, 2020, 8:35am

root@pynq:/home/xilinx# cat tvm/apps/vta_rpc/start_rpc_server_to_tracker.sh 
#!/bin/bash
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements.  See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership.  The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License.  You may obtain a copy of the License at
#
#   http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing,
# software distributed under the License is distributed on an
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
# KIND, either express or implied.  See the License for the
# specific language governing permissions and limitations
# under the License.
PROJROOT="$( cd "$( dirname '${BASH_SOURCE[0]}' )/../../" && pwd )"

# Derive target specified by vta_config.json
VTA_CONFIG=/home/xilinx/tvm/3rdparty/vta-hw/config/vta_config.py
TARGET=$(python ${VTA_CONFIG} --target)

export PYTHONPATH=${PYTHONPATH}:${PROJROOT}/python:${PROJROOT}/vta/python
export PYTHONPATH=${PYTHONPATH}:/home/xilinx/pynq
python3 -m vta.exec.rpc_server --tracker 192.168.0.114:9190 --key $TARGET

thkim · November 26, 2020, 8:44am

@isong @hht Thank you so much for your response. Although I’ve checked a bit file is in /home/taeho/.vta_cache/pynq/0_0_1/1x16_i8w8a32_15_15_18_17.bit, there was a problem. Instead, in the rpc_client.py, I’ve written bitstream = “dir” which is the bit file directory I’ve saved in the local dir. It works now. Thanks! Yes, I also wrote the environment variables like you!

I’ll work on this continuously, so I hope we can share something in the future. Thank you, again

hht · November 26, 2020, 9:07am

@thkim Glad to hear your good news. I have been working to fix the memory leak caused by rpc exception. VTA use CMA memory and CMA is managed by /dev/xlnk. Directly terminating the process will only release heap and stack memory but still cause a memory leak. I used two ways to make the workaround but can’t find an elegant solution. There are a lot of things to do and to discuss.

isong · November 26, 2020, 9:12am

Hi @hht

Interesting, I am interested in what’s your 2 ways fo workarounds? Would you mind sharing them?

Thank you,

@thkim, Glad to hear that you sorted out things.

Cheers,

hht · November 26, 2020, 9:18am

One way is to create a global class record cma alloc and free. When exception occurs, the python-rpc-server firstly provided the C API to invoke xilinx cma_free() to release the cma memory then terminate the process.

hht · November 26, 2020, 9:24am

Another way is to avoid rpc exception and improve tuning efficiency. I replace batch 300 input tensor with batch 1 input tensor. And use simple python script the generate simpler net. I thought that with GEMM batch 300 conv2d could share nearly the same optimal schedule with batch 1 conv2d.

isong · November 26, 2020, 10:47am

Hi @htt

Thank you for sharing your work. It seems the 2nd option made more sense, though it seems to be good to improve rpc exception handling.

Cheers, ISS

thkim · December 4, 2020, 10:31pm

Hi,

I want to check several things.

Does your tutorial work with ‘xgb’ tuner?
I tried to compare non-tuning with tuning and failed to see the improvement for execution time after tuning. Did you see the improvement after tuning?
@isong I’m using one PYNQ-Z1 board. When I ran the tutorial, the mean inference time was 365 ~ 372 ms. Your example shows 69.65 ms. Could you give me some advice to decrease the mean inference time?

Thanks!

hht · December 6, 2020, 4:19am

@thkim

I am not sure, but with random tunner it works.
Because you use the tophub well-tuned params. If you try with fallback config, it will be great improvement.
I use PYNQ Z1 and ZCU104. I think 365ms is reasonable.

isong · December 7, 2020, 1:18am

Hi @thkim

For 1 and 2, I think @hht has already answered.

For 3, I used ultra96, could be the reason why the difference.