[VTA][PYNQ] Help debugging VTA on the ZCU111

fPecc · February 10, 2023, 1:49pm

Hi all!

I am trying to run the VTA on the ZCU111 Xilinx board. So far, I was able to generate the bitstream and compile the VTA runtime on the board, but need some ideas on how to debug the problem I found. I will detail my steps here:

PYNQ image: 2.6
I added a zcu111 configuration in /3rdparty/vta-hw/config/pkg_config.py:

...
elif self.TARGET == "zcu111":
    self.fpga_device = "xczu28dr-ffvg1517-2-e"
    self.fpga_family = "zynq-ultrascale+"
    self.fpga_board = "xilinx.com:zcu111:part0"
    self.fpga_board_rev = "1.4"
    self.fpga_freq = 300
    self.fpga_per = 2
    self.fpga_log_axi_bus_width = 7
    self.axi_prot_bits = '010'
    # IP register address map
    self.ip_reg_map_range = "0x1000"
    self.fetch_base_addr = "0xA0000000"
    self.load_base_addr = "0xA0001000"
    self.compute_base_addr = "0xA0002000"
    self.store_base_addr = "0xA0003000"
...

I then followed Bitstream Generation with Xilinx Toolchains and was able to generate the bitstream without errors (I had to make some additional small changes to be able to generate it using Vivado 2020.2).

I have inspected the generated Vivado project, and verified the following:

Timing was achieved correctly for a frequency of 300 MHz.
I can see in the generated block design address ranges that the correct addresses configured in the Python script pkg_config.py are correctly configured in the AXI mapping between the Zynq and the VTA modules. Notice that this addresses are exactly the sames as the ultra96 target.

Then, I built the VTA runtime on the ZCU111 board, making sure that the vta_config.json file in the board is the same that was used in the host to generate the bitstream, but I changed the target to “ultra96”. I also made sure that the USE_VTA_FPGA option in the config.cmake file is activated. I followed this section. Build was successful, and I was able to start the RPC server in the board.

From the host computer, I tried to run the matrix_multiply.py tutorial.

I changed the VTA_RPC_HOST line to add the specific IP of my board.
I added one option in the if that is used to program the FPGA, in case the env.TARGET == “ultra96”.
In vta.program_fpga, I added the path to my generated bitstream file.

When running the script on the host, the schedule is correctly compiled but the script freezes in line:

# Invoke the module to perform the computation
f(A_nd, B_nd, C_nd)

I can see the following output in the terminal where I started the rpc server in the FPGA board:

2020-10-19 19:48:33.142 INFO bind to 0.0.0.0:9091
2020-10-19 19:48:35.836 INFO connection from ('192.168.0.1', 45652)
INFO:root:Skip reconfig_runtime due to same config.
INFO:root:Generating grammar tables from /usr/lib/python3.6/lib2to3/Grammar.txt
INFO:root:Generating grammar tables from /usr/lib/python3.6/lib2to3/PatternGrammar.txt
INFO:root:Program FPGA with vta.bit 
INFO:root:Loading VTA library: /home/xilinx/tvm/vta/python/vta/../../../build/libvta.so
2020-10-19 19:48:42.242 INFO load_module /tmp/tmpyy0qx27i/gemm.o

I also tried to run script vta/tests/python/integration/test_benchmark_topi_conv2d.py, and obtained a similar problem: all convolution measurements on the CPU worked fine, but when the VTA measurements started, the script freezes when executing the first one.

I interpret that the schedule is correctly cross-compiled and that the generated module is correctly loaded in the FPGA board. Are there more steps/flags available to try to debug this issue?

Small extra test: I found this post and this post stating that this could be a coherence problem. So I tried to generate the bitstream without coherence activated (there’s a coherence flag in pkg_config.py). After generating the new bitstream, I tried to run matrix_multiply.py and test_benchmark_topi_conv2d.py again, but I found the same problem.

fPecc · February 17, 2023, 9:50am

So, I was able to get it running on the ZCU111 using the HLS backend, but there seems to be a bug because the matrix multiply test always returned errors in the matrix multiplication in VTA.

So, I then tried to generate the bitstream using the Chisel backend, taking the generated System Verilog file and generating an IP block, and then importing it into a block design in a Vivado project to be able to generate the bitstream. I also had to modify the pynq_driver.cc file to resemble the de10nano driver, in order to correctly write the registers exposed on the master AXI interface generated by the Chisel backend (which are NOT the same register offsets as the ones generated by the HLS backend!).

I also had to set flag kBufferCoherent in vta/runtime/runtime.cc to false (this was already mentioned in another post which I cannot find now, regarding VTA in an Ultrascale+ architecture).

Here a report of the test I executed:

matrix_multiply.py: PASS
test_benchmark_gemm.py: PASS
test_benchmark_topi_conv2d_transpose.py: PASS with target arm_cpu, FAILED with vta
test_benchmark_topi_conv2d.py: PASS
test_benchmark_topi_dense.py: PASS
test_benchmark_topi_group_conv2d.py: PASS
deploy_classification.py: had to replace line “m.module.time_evaluator(“run”,…” with just m.run(), because there was a segmentation fault on the RPC server. PASS with device=arm_cpu (cat detected), FAILED with device=vta (cat NOT detected).
deploy_detection.py: compilation failed for device=arm_cpu (appears to be an LLVM issue), device_vta compiled but FAILED (hundreds of bounding boxes where found, none related to the objects in the image).

EDIT: changing from graph_executor to debug_executor in deploy_classification.py classifies the cat correctly both for device arm_cpu and vta. I will be taking a look at this, why am I getting different results with both executors.

fPecc · April 27, 2023, 7:11am

It’s alive!

I will document here what I did, in case someone encounters the same issue:

I managed to make it work by connecting the AXI master port of the VTA IP Block to the ACP port of the Zynq: it seems the problem was a coherence problem, which can be solved using the ACP port. But this 2 ports can not be connected as they are, because the ACP has a lot of limitations, so I used this repository, which is a module that partitions AXI requests and generates the ACP accepted requests.

IMPORTANT: the ACP transactions need to have:

AxCACHE = “1111”
AxPROT = “110”

rass · May 4, 2023, 3:52pm

Hi and thank you for coming back to share your workaround, I have been going through the same issues for a long time now.

Are you using the Chisel or the HLS backend ?

Could you please help by providing the configuration you used for the ACP Adapter, or possibly the whole block design ?

Jake · March 26, 2024, 4:26pm

I’m also having a hard time testing the VTA with the ZCU board. May I know which TVM version you’re using?