Thanks for the reply. I have not got a chance running on the FPGA board, but just ran the Chisel-based cycle accurate simulator. From your description, I think the reason is that Chisel-based design is not 100% the same as Vivado-HLS. I was running the conv2D layer in tutorial: https://docs.tvm.ai/vta/tutorials/optimize/convolution_opt.html#sphx-glr-vta-tutorials-optimize-convolution-opt-py. I would expect a 10-20% throughput difference with the optimal case, but 4x slower seems something went wrong.