ZERO COPY memory transfer between FPGA and CPU?

jtuyls · April 12, 2021, 8:32am

You can utilise multiple CU’s by creating multiple TVM modules (graph_runtime.GraphModule) and run with those in separate threads. I thought this was what you are doing based on the discussion here: https://discuss.tvm.apache.org/t/re-re-vitis-ai-integration-multi-thread-c-application-hang/9415/6. Alternatively, you can increase the batch size to the number of CU’s to make use of multiple CU’s.
Same answer as 1.
You can create a separate TVM GraphModule for each model.
In the TVM - Vitis AI flow, the CPU is waiting for the DPU to return results. I guess the CPU is utilised 100% because TVM will utilise as many resources as possible. However, you could limit the number of threads being used by TVM (Limit CPU cores for Auto tuned Model - #3 by sol401430). Btw, note that the heterogeneous TVM CPU - Vitis AI flow is different from the pure GPU flow I expect you are using.

Overall, I think documentation is lacking in this respect and we will try to improve that. Additionally, there are some issues with the different platforms (like multithreaded DPU hanging issue on Pynq) that we will try to get resolved. As mentioned earlier, we will be moving to the Vitis AI VART flow shortly and we will add more documentation and/or examples on this at the same time.