[VTA] Workaround for Autotuning with One PYNQ Z1 Board

hht · October 6, 2020, 8:29am

I find the workaround for autotuning with one PYNQ and locate the problem. In the VTA autotuning tutorial, there is a handle named remote.

The remote does two things. One is to program FPGA.

    if env.TARGET != "sim":
        # Get remote from fleet node
        remote = autotvm.measure.request_remote(
            env.TARGET, tracker_host, tracker_port, timeout=10000
        )
        # Reconfigure the JIT runtime and FPGA.
        vta.reconfig_runtime(remote)
        vta.program_fpga(remote, bitstream=None)
    else:
        # In simulation mode, host the RPC server locally.
        remote = rpc.LocalSession()

Another is to run the whole net and give the result after autotuning.

# compile kernels with history best records
    with autotvm.tophub.context(target, extra_files=[log_file]):
        # Compile network
        print("Compile...")
        if target.device_name != "vta":
            with tvm.transform.PassContext(opt_level=3, disabled_pass={"AlterOpLayout"}):
                lib = relay.build(
                    relay_prog, target=target, params=params, target_host=env.target_host
                )
        else:
            with vta.build_config(opt_level=3, disabled_pass={"AlterOpLayout"}):
                lib = relay.build(
                    relay_prog, target=target, params=params, target_host=env.target_host
                )

        # Export library
        print("Upload...")
        temp = util.tempdir()
        lib.save(temp.relpath("graphlib.o"))
        remote.upload(temp.relpath("graphlib.o"))
        lib = remote.load_module("graphlib.o")

        # Generate the graph runtime
        ctx = remote.ext_dev(0) if device == "vta" else remote.cpu(0)
        m = graph_runtime.GraphModule(lib["default"](ctx))

        # upload parameters to device
        image = tvm.nd.array((np.random.uniform(size=(1, 3, 224, 224))).astype("float32"))
        m.set_input("data", image)

        # evaluate
        print("Evaluate inference time cost...")
        timer = m.module.time_evaluator("run", ctx, number=1, repeat=10)
        tcost = timer()
        prof_res = np.array(tcost.results) * 1000  # convert to millisecond
        print(
            "Mean inference time (std dev): %.2f ms (%.2f ms)"
            % (np.mean(prof_res), np.std(prof_res))
        )

The remote occupies a device all the time but it play no role in autotuning. So my workaround is to comment out the code above to remove the remote and it works.

Extract tasks...
Extracted 10 conv2d tasks:
(1, 14, 14, 256, 512, 1, 1, 0, 0, 2, 2)
(1, 28, 28, 128, 256, 1, 1, 0, 0, 2, 2)
(1, 56, 56, 64, 128, 1, 1, 0, 0, 2, 2)
(1, 56, 56, 64, 64, 3, 3, 1, 1, 1, 1)
(1, 28, 28, 128, 128, 3, 3, 1, 1, 1, 1)
(1, 56, 56, 64, 128, 3, 3, 1, 1, 2, 2)
(1, 14, 14, 256, 256, 3, 3, 1, 1, 1, 1)
(1, 28, 28, 128, 256, 3, 3, 1, 1, 2, 2)
(1, 7, 7, 512, 512, 3, 3, 1, 1, 1, 1)
(1, 14, 14, 256, 512, 3, 3, 1, 1, 2, 2)
Tuning...
[Task  1/10]  Current/Best:    0.00/  28.79 GFLOPS | Progress: (480/480) | 306.61 s Done.
[Task  2/10]  Current/Best:    0.00/  31.41 GFLOPS | Progress: (576/576) | 389.47 s Done.
[Task  3/10]  Current/Best:    0.00/  43.20 GFLOPS | Progress: (1000/1000) | 667.90 s Done.
[Task  4/10]  Current/Best:    0.00/  46.37 GFLOPS | Progress: (1000/1000) | 564.08 s Done.
[Task  5/10]  Current/Best:    0.00/  38.90 GFLOPS | Progress: (1000/1000) | 641.09 s Done.
[Task  6/10]  Current/Best:    0.00/  44.39 GFLOPS | Progress: (1000/1000) | 560.03 s Done.
[Task  7/10]  Current/Best:    0.00/  40.67 GFLOPS | Progress: (1000/1000) | 731.33 s Done.
[Task  8/10]  Current/Best:    0.00/   9.58 GFLOPS | Progress: (1000/1000) | 1046.03 s Done.
[Task  9/10]  Current/Best:    0.00/  12.51 GFLOPS | Progress: (1000/1000) | 1276.48 s Done.
[Task 10/10]  Current/Best:    0.31/  11.95 GFLOPS | Progress: (480/480) | 619.91 s Done.

hht · October 6, 2020, 8:36am

I found the problem had been reported a year before.

cc @youn123 @diamantopoulos @zhanghaohit @ffffc

thkim · November 6, 2020, 2:17am

May I ask what version of the code (incubator-tvm) you used?

hht · November 6, 2020, 5:40am

0.7 release 2020-10-02

isong · November 23, 2020, 9:30am

@hht

Thank you so much, this fixed my issue.

isong · November 23, 2020, 10:38am

@hht

I got the the tutorial working; however, it failed after the tune. Have you see this error?

Thank you,

$ python3 ./tune_relay_vta_one_board.py 
Extract tasks...
Extracted 10 conv2d tasks:
(1, 14, 14, 256, 512, 1, 1, 0, 0, 2, 2)
(1, 28, 28, 128, 256, 1, 1, 0, 0, 2, 2)
(1, 56, 56, 64, 128, 1, 1, 0, 0, 2, 2)
(1, 56, 56, 64, 64, 3, 3, 1, 1, 1, 1)
(1, 28, 28, 128, 128, 3, 3, 1, 1, 1, 1)
(1, 56, 56, 64, 128, 3, 3, 1, 1, 2, 2)
(1, 14, 14, 256, 256, 3, 3, 1, 1, 1, 1)
(1, 28, 28, 128, 256, 3, 3, 1, 1, 2, 2)
(1, 7, 7, 512, 512, 3, 3, 1, 1, 1, 1)
(1, 14, 14, 256, 512, 3, 3, 1, 1, 2, 2)
Tuning...
[Task 10/10]  Current/Best:    4.91/  22.07 GFLOPS | Progress: (100/100) | 83.85 sCompile...

Upload...
Traceback (most recent call last):
  File "./tune_relay_vta_one_board.py", line 457, in <module>
    tune_and_evaluate(tuning_option)
  File "./tune_relay_vta_one_board.py", line 432, in tune_and_evaluate
    temp = utils.tempdir()
AttributeError: 'GraphRuntimeFactoryModule' object has no attribute 'save'

So it failed at lib.save(temp.relpath("graphlib.o")).

hht · November 23, 2020, 10:55am

You can move the ‘remote’ block behind tuning process to get inference time cost. Your result shows ‘Upload’, so I guess you use the inference time cost block. In my code, I just uncomment these blocks.

isong · November 23, 2020, 11:30am

@hht Thank you for your answer. I used your code except uncomment the return so that I can tune. It failed after the tune is done at lib.save(temp.relpath("graphlib.o"))

github.com

i24361/incubator-tvm/blob/0472b1f347976229a29be8a6e60b626a0604c8df/vta/tutorials/autotvm/tune_relay_vta_with_one_board.py

# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements.  See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership.  The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License.  You may obtain a copy of the License at
#
#   http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing,
# software distributed under the License is distributed on an
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
# KIND, either express or implied.  See the License for the
# specific language governing permissions and limitations
# under the License.
"""
Auto-tuning a convolutional network on VTA
==========================================
**Author**: `Lianmin Zheng <https://github.com/merrymercy>`_, `Thierry Moreau <https://homes.cs.washington.edu/~moreau/>`_

This file has been truncated. show original

    ...
     # We do not run the tuning in our webpage server since it takes too long.
     # Comment the following line to run it by yourself.
     return
 
     # run tuning tasks
     print("Tuning...")
     tune_tasks(tasks, **tuning_opt)
 
     # evaluate with tuning history
     if env.TARGET != "sim":
         # Get remote from fleet node
         remote = autotvm.measure.request_remote(
             env.TARGET, tracker_host, tracker_port, timeout=10000
         )
         # Reconfigure the JIT runtime and FPGA.
         vta.reconfig_runtime(remote)
         vta.program_fpga(remote, bitstream=None)
     else:
         # In simulation mode, host the RPC server locally.
         remote = rpc.LocalSession()
 
     # compile kernels with history best records
     with autotvm.tophub.context(target, extra_files=[log_file]):
         # Compile network
         print("Compile...")
         if target.device_name != "vta":
             with tvm.transform.PassContext(opt_level=3, disabled_pass={"AlterOpLayout"}):
                 lib = relay.build(
                     relay_prog, target=target, params=params, target_host=env.target_host
                 )
         else:
             with vta.build_config(opt_level=3, disabled_pass={"AlterOpLayout"}):
                 lib = relay.build(
                     relay_prog, target=target, params=params, target_host=env.target_host
                 )
 
         # Export library
         print("Upload...")
         temp = utils.tempdir()
         lib.save(temp.relpath("graphlib.o")). <<<<< failed it here
         remote.upload(temp.relpath("graphlib.o"))
         lib = remote.load_module("graphlib.o")
 
         # Generate the graph runtime
         ctx = remote.ext_dev(0) if device == "vta" else remote.cpu(0)
         m = graph_runtime.GraphModule(lib["default"](ctx))
 
         # upload parameters to device
         image = tvm.nd.array((np.random.uniform(size=(1, 3, 224, 224))).astype("float32"))
         m.set_input("data", image)
 
         # evaluate
         print("Evaluate inference time cost...")
         timer = m.module.time_evaluator("run", ctx, number=1, repeat=10)
         tcost = timer()
         prof_res = np.array(tcost.results) * 1000  # convert to millisecond
         print(
             "Mean inference time (std dev): %.2f ms (%.2f ms)"
             % (np.mean(prof_res), np.std(prof_res))
         )

Thought?

Thank you very much,

isong · November 23, 2020, 11:54am

Figured it out. Instead of save, which is likely due to the API changes over time, export_library works

So here is the change and it works.

        # Export library
        print("Upload...")
        #temp = utils.tempdir()
        #lib.save(temp.relpath("graphlib.o"))
        #remote.upload(temp.relpath("graphlib.o"))
        #lib = remote.load_module("graphlib.o")

        # Send the inference library over to the remote RPC server
        temp = utils.tempdir()
        lib.export_library(temp.relpath("graphlib.tar"))
        remote.upload(temp.relpath("graphlib.tar"))
        lib = remote.load_module("graphlib.tar")

Extract tasks...
Extracted 10 conv2d tasks:
(1, 14, 14, 256, 512, 1, 1, 0, 0, 2, 2)
(1, 28, 28, 128, 256, 1, 1, 0, 0, 2, 2)
(1, 56, 56, 64, 128, 1, 1, 0, 0, 2, 2)
(1, 56, 56, 64, 64, 3, 3, 1, 1, 1, 1)
(1, 28, 28, 128, 128, 3, 3, 1, 1, 1, 1)
(1, 56, 56, 64, 128, 3, 3, 1, 1, 2, 2)
(1, 14, 14, 256, 256, 3, 3, 1, 1, 1, 1)
(1, 28, 28, 128, 256, 3, 3, 1, 1, 2, 2)
(1, 7, 7, 512, 512, 3, 3, 1, 1, 1, 1)
(1, 14, 14, 256, 512, 3, 3, 1, 1, 2, 2)
Tuning...
[Task  1/10]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (10/10) | 1.87 sWARNING:root:Could not find any valid schedule for task Task(func_name=conv2d_packed.vta, args=(('TENSOR', (1, 16, 14, 14, 1, 16), 'int8'), ('TENSOR', (32, 16, 3, 3, 16, 16), 'int8'), (2, 2), (1, 1, 1, 1), (1, 1), 'NCHW1n16c', 'int32'), kwargs={}, workload=('conv2d_packed.vta', ('TENSOR', (1, 16, 14, 14, 1, 16), 'int8'), ('TENSOR', (32, 16, 3, 3, 16, 16), 'int8'), (2, 2), (1, 1, 1, 1), (1, 1), 'NCHW1n16c', 'int32')). A file containing the errors has been written to /tmp/tvm_tuning_errors_ys3zu_2v.log.
INFO:autotvm:Get devices for measurement successfully!
[Task  2/10]  Current/Best:    0.00/  70.42 GFLOPS | Progress: (10/10) | 6.26 sINFO:autotvm:Get devices for measurement successfully!
[Task  3/10]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (10/10) | 4.79 sWARNING:root:Could not find any valid schedule for task Task(func_name=conv2d_packed.vta, args=(('TENSOR', (1, 8, 28, 28, 1, 16), 'int8'), ('TENSOR', (16, 8, 3, 3, 16, 16), 'int8'), (2, 2), (1, 1, 1, 1), (1, 1), 'NCHW1n16c', 'int32'), kwargs={}, workload=('conv2d_packed.vta', ('TENSOR', (1, 8, 28, 28, 1, 16), 'int8'), ('TENSOR', (16, 8, 3, 3, 16, 16), 'int8'), (2, 2), (1, 1, 1, 1), (1, 1), 'NCHW1n16c', 'int32')). A file containing the errors has been written to /tmp/tvm_tuning_errors_fqpg5ysx.log.
INFO:autotvm:Get devices for measurement successfully!
[Task  4/10]  Current/Best:   31.80/  31.80 GFLOPS | Progress: (10/10) | 8.19 sINFO:autotvm:Get devices for measurement successfully!
[Task  5/10]  Current/Best:    0.00/  25.89 GFLOPS | Progress: (10/10) | 5.99 sINFO:autotvm:Get devices for measurement successfully!
[Task  6/10]  Current/Best:    0.00/  72.11 GFLOPS | Progress: (10/10) | 6.80 sINFO:autotvm:Get devices for measurement successfully!
[Task  7/10]  Current/Best:    0.00/  19.19 GFLOPS | Progress: (10/10) | 5.65 sINFO:autotvm:Get devices for measurement successfully!
[Task  8/10]  Current/Best:    0.00/   5.28 GFLOPS | Progress: (10/10) | 7.45 sINFO:autotvm:Get devices for measurement successfully!
[Task  9/10]  Current/Best:    1.21/   5.83 GFLOPS | Progress: (10/10) | 14.37 sINFO:autotvm:Get devices for measurement successfully!
[Task 10/10]  Current/Best:    0.00/   6.53 GFLOPS | Progress: (10/10) | 4.38 sINFO:autotvm:Extract 10 best records from the vta.resnet18_v1.log.tmp
Compile...
Upload...
Evaluate inference time cost...
Mean inference time (std dev): 69.65 ms (2.26 ms)

thkim · November 25, 2020, 7:50pm

@hht @isong Thank you for sharing your solution and issues. I’m also trying to run this “tune_relay_vta.py”. After downloading tvm v0.7 released 2020-10-02, I’ve revised “vta_config.json” (TARGET: “sim” -> “pynq” from /3rdparty/vta-hw/config/vta_config.json) and saved a bitstream file “1x16_i8w8a32_15_15_18_17.bit” in the local directory to avoid the wrong link issue. Then, in the build directory, I’ve designated the llvm route of “config.cmake” for the host PC. Likewise, I’ve revised USE_VTA_FPGA OFF -> ON of “config.cmake” for the pynq board. There seemed to be no problems with the device connection through the tracker and the bitstream file connection.

However, when I run the “tune_relay_vta.py” (I’ve commented out the code @hht explained.), it doesn’t increase GFLOPS (always 0.00/ 0.00 GFLOPS). Could you give me some advice? Do I need to change another code?

isong · November 26, 2020, 1:02am

Hi @thkim

If you update your from the above githhub link and my change on the export_library, then let me check few more.

It is not a connection issue, as you see GFLOPS (always 0.00/ 0.00 GFLOPS) message. Does it every finish? How many iteration you’ve set? Are other testing such as this that are not using autotvm works?

Cheers, ISS

hht · November 26, 2020, 1:36am

Check dmesg from your remote device. Sometimes 0.00 GFLOPS means no bitstream writtten to FPGA.

hht · November 26, 2020, 1:39am

Make sure you start remote device server with root because writing bitstream need root authority.

thkim · November 26, 2020, 7:53am

@isong @hht Thank you so much for your reply. I think I found a problem. When I run the tutorial “Deploy Pretrained Vision Model from MxNet on VTA”, I could write a bitstream file to FPGA, and it worked well. I’ve written vta.program_fpga(remote, bitstream=“bitfiles/v4/1x16_i8w8a32_15_15_18_17.bit”) in the sample code. I’ve checked it through dmesg.

However, when I run the tutorial “Auto-tuning a convolutional network on VTA” (tune_relay_vta_with_one_board.py), I couldn’t write a bitstream file to FPGA. During the tuning, it seems to access the program_fpga function of rpc_client.py per 8 iterations and get the bitstream="/home/taeho/.vta_cache/pynq/0_0_1/1x16_i8w8a32_15_15_18_17.bit". When I checked dmesg from my pynq board, it failed to write a bitstream to FPGA. I couldn’t find the exact reason.

For the tutorial “Deploy Pretrained Vision Model from MxNet on VTA”, I’ve run “sudo apps/vta_rpc/start_rpc_server.sh” on pynq. For tune_relay_vta_with_one_board.py, I’ve run “sudo apps/vta_rpc/start_rpc_server_to_tracker.sh” on pynq. @hht I’m not sure if it is correct to start remote device server with root.

Again, thank you so much for your help!

isong · November 26, 2020, 8:16am

Hi @thkim

Did you check if your bit file is at /home/taeho/.vta_cache/pynq/0_0_1/1x16_i8w8a32_15_15_18_17.bit"?

In my case, I copy the bit file to the above dir.

I also use sudo to run tracker

pushd apps/vta_rpc
sudo -E ./start_rpc_server_to_tracker.sh
popd

hht · November 26, 2020, 8:33am

Hi @thkim

I use su to start the tracker and modify the corresponding environment variables.

TVM_HOME="/home/xilinx/tvm"
VTA_HW_PATH="/home/xilinx/tvm/3rdparty/vta-hw"
PYTHONPATH="/home/xilinx/tvm/python:/home/xilinx/tvm/topi/python:/home/xilinx/tvm/vta/python"

hht · November 26, 2020, 8:35am

root@pynq:/home/xilinx# cat tvm/apps/vta_rpc/start_rpc_server_to_tracker.sh 
#!/bin/bash
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements.  See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership.  The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License.  You may obtain a copy of the License at
#
#   http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing,
# software distributed under the License is distributed on an
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
# KIND, either express or implied.  See the License for the
# specific language governing permissions and limitations
# under the License.
PROJROOT="$( cd "$( dirname '${BASH_SOURCE[0]}' )/../../" && pwd )"

# Derive target specified by vta_config.json
VTA_CONFIG=/home/xilinx/tvm/3rdparty/vta-hw/config/vta_config.py
TARGET=$(python ${VTA_CONFIG} --target)

export PYTHONPATH=${PYTHONPATH}:${PROJROOT}/python:${PROJROOT}/vta/python
export PYTHONPATH=${PYTHONPATH}:/home/xilinx/pynq
python3 -m vta.exec.rpc_server --tracker 192.168.0.114:9190 --key $TARGET

thkim · November 26, 2020, 8:44am

@isong @hht Thank you so much for your response. Although I’ve checked a bit file is in /home/taeho/.vta_cache/pynq/0_0_1/1x16_i8w8a32_15_15_18_17.bit, there was a problem. Instead, in the rpc_client.py, I’ve written bitstream = “dir” which is the bit file directory I’ve saved in the local dir. It works now. Thanks! Yes, I also wrote the environment variables like you!

I’ll work on this continuously, so I hope we can share something in the future. Thank you, again

hht · November 26, 2020, 9:07am

@thkim Glad to hear your good news. I have been working to fix the memory leak caused by rpc exception. VTA use CMA memory and CMA is managed by /dev/xlnk. Directly terminating the process will only release heap and stack memory but still cause a memory leak. I used two ways to make the workaround but can’t find an elegant solution. There are a lot of things to do and to discuss.

isong · November 26, 2020, 9:12am

Hi @hht

Interesting, I am interested in what’s your 2 ways fo workarounds? Would you mind sharing them?

Thank you,

@thkim, Glad to hear that you sorted out things.

Cheers,