[VTA][PYNQ] Help debugging VTA on the ZCU111

Hi all!

I am trying to run the VTA on the ZCU111 Xilinx board. So far, I was able to generate the bitstream and compile the VTA runtime on the board, but need some ideas on how to debug the problem I found. I will detail my steps here:

  • PYNQ image: 2.6
  • I added a zcu111 configuration in /3rdparty/vta-hw/config/pkg_config.py:
elif self.TARGET == "zcu111":
    self.fpga_device = "xczu28dr-ffvg1517-2-e"
    self.fpga_family = "zynq-ultrascale+"
    self.fpga_board = "xilinx.com:zcu111:part0"
    self.fpga_board_rev = "1.4"
    self.fpga_freq = 300
    self.fpga_per = 2
    self.fpga_log_axi_bus_width = 7
    self.axi_prot_bits = '010'
    # IP register address map
    self.ip_reg_map_range = "0x1000"
    self.fetch_base_addr = "0xA0000000"
    self.load_base_addr = "0xA0001000"
    self.compute_base_addr = "0xA0002000"
    self.store_base_addr = "0xA0003000"

I have inspected the generated Vivado project, and verified the following:

  • Timing was achieved correctly for a frequency of 300 MHz.
  • I can see in the generated block design address ranges that the correct addresses configured in the Python script pkg_config.py are correctly configured in the AXI mapping between the Zynq and the VTA modules. Notice that this addresses are exactly the sames as the ultra96 target.

Then, I built the VTA runtime on the ZCU111 board, making sure that the vta_config.json file in the board is the same that was used in the host to generate the bitstream, but I changed the target to “ultra96”. I also made sure that the USE_VTA_FPGA option in the config.cmake file is activated. I followed this section. Build was successful, and I was able to start the RPC server in the board.

From the host computer, I tried to run the matrix_multiply.py tutorial.

  • I changed the VTA_RPC_HOST line to add the specific IP of my board.
  • I added one option in the if that is used to program the FPGA, in case the env.TARGET == “ultra96”.
  • In vta.program_fpga, I added the path to my generated bitstream file.

When running the script on the host, the schedule is correctly compiled but the script freezes in line:

# Invoke the module to perform the computation
f(A_nd, B_nd, C_nd)

I can see the following output in the terminal where I started the rpc server in the FPGA board:

2020-10-19 19:48:33.142 INFO bind to
2020-10-19 19:48:35.836 INFO connection from ('', 45652)
INFO:root:Skip reconfig_runtime due to same config.
INFO:root:Generating grammar tables from /usr/lib/python3.6/lib2to3/Grammar.txt
INFO:root:Generating grammar tables from /usr/lib/python3.6/lib2to3/PatternGrammar.txt
INFO:root:Program FPGA with vta.bit 
INFO:root:Loading VTA library: /home/xilinx/tvm/vta/python/vta/../../../build/libvta.so
2020-10-19 19:48:42.242 INFO load_module /tmp/tmpyy0qx27i/gemm.o

I also tried to run script vta/tests/python/integration/test_benchmark_topi_conv2d.py, and obtained a similar problem: all convolution measurements on the CPU worked fine, but when the VTA measurements started, the script freezes when executing the first one.

I interpret that the schedule is correctly cross-compiled and that the generated module is correctly loaded in the FPGA board. Are there more steps/flags available to try to debug this issue?

Small extra test: I found this post and this post stating that this could be a coherence problem. So I tried to generate the bitstream without coherence activated (there’s a coherence flag in pkg_config.py). After generating the new bitstream, I tried to run matrix_multiply.py and test_benchmark_topi_conv2d.py again, but I found the same problem.

So, I was able to get it running on the ZCU111 using the HLS backend, but there seems to be a bug because the matrix multiply test always returned errors in the matrix multiplication in VTA.

So, I then tried to generate the bitstream using the Chisel backend, taking the generated System Verilog file and generating an IP block, and then importing it into a block design in a Vivado project to be able to generate the bitstream. I also had to modify the pynq_driver.cc file to resemble the de10nano driver, in order to correctly write the registers exposed on the master AXI interface generated by the Chisel backend (which are NOT the same register offsets as the ones generated by the HLS backend!).

I also had to set flag kBufferCoherent in vta/runtime/runtime.cc to false (this was already mentioned in another post which I cannot find now, regarding VTA in an Ultrascale+ architecture).

Here a report of the test I executed:

  • matrix_multiply.py: PASS
  • test_benchmark_gemm.py: PASS
  • test_benchmark_topi_conv2d_transpose.py: PASS with target arm_cpu, FAILED with vta
  • test_benchmark_topi_conv2d.py: PASS
  • test_benchmark_topi_dense.py: PASS
  • test_benchmark_topi_group_conv2d.py: PASS
  • deploy_classification.py: had to replace line “m.module.time_evaluator(“run”,…” with just m.run(), because there was a segmentation fault on the RPC server. PASS with device=arm_cpu (cat detected), FAILED with device=vta (cat NOT detected).
  • deploy_detection.py: compilation failed for device=arm_cpu (appears to be an LLVM issue), device_vta compiled but FAILED (hundreds of bounding boxes where found, none related to the objects in the image).

EDIT: changing from graph_executor to debug_executor in deploy_classification.py classifies the cat correctly both for device arm_cpu and vta. I will be taking a look at this, why am I getting different results with both executors.

1 Like