@liangfu and @thierry, thanks for your inputs. I have been able to build the chisel code for Pynq-z1, and I have tried to test it with the existing matrix_multiply.py
example code after reducing the size of the input matrices. It has been able to run but there are a couple of issues.
What I have done so far:
- I had to reduce the
instQueueEntries
incore/Configs.scala
since the design didn’t have enough space on the Z1. I believe since the de10 is a bigger board it wouldn’t have the same issues. - I had to reduce the matrix size from 16x16 to 8x8 (BLOCKIN and BLOCKOUT from 16 to 8). It was mostly due to timing constraint and I don’t know how to customize the fpga_clk (increase the width from the existing 10ns).
- Changed the
VTADevice
class in thepynq_driver.cc
to be similar to that ofde10nano_driver.cc
, as the chisel code is being used instead of the existing HLS generated hardware.
The issue we are facing is that the hardware only returns with the result occasionally. When it does return the test passes. At this point I am not sure exactly what is happening. It seems that the hardware gets stuck and does not set the finish flag. I am not sure where it is getting stuck though. I would like to know your opinion on this.