ZERO COPY memory transfer between FPGA and CPU?

Hi @jtuyls / @mak ,

In the TVM-Vitis workflow on ZCU104 platform, what is the memory read/write latency between FPGA and CPU? Is it zero copy? or you do any memory copy and rearrange the memory on the FPGA side?

Could you please point us any reference document on this?

Thanks and Regards, Raju

@jtuyls/@mak ,

Also Why CPU cores are using 100% while running model on FPGA? It should be zero usage on CPU size right ? Because we are offloading the workload to FPGA and CPU cores should ideal. Please correct me if I am missing some thing here.

In case of Jetson Nano, we noticed that the zero usage in CPU side when the model is running using on nvidia GPU.

Thanks and Regards, Raju

Hi @jtuyls / @mak ,

Any updates on this? Looks like current TVM+ Vitis workflow is not production, Am I missing something here?

Because latency wise I see following bottlenecks in the current workflow

  1. Only one CU was getting utilise and we don’t know how to use second one ?
  2. Not sure how to use both CUs in parallel?
  3. Not sure how can we load different models in different CUs?
  4. When we offload the acceleration part to the FPGA and still all the ARM CPUs were 100% utilised and here expectation is that zero CPUs usage when we offload processing part to FPGA.

Thanks and Regards, Raju

@kvaju.454

  1. You can utilise multiple CU’s by creating multiple TVM modules (graph_runtime.GraphModule) and run with those in separate threads. I thought this was what you are doing based on the discussion here: https://discuss.tvm.apache.org/t/re-re-vitis-ai-integration-multi-thread-c-application-hang/9415/6. Alternatively, you can increase the batch size to the number of CU’s to make use of multiple CU’s.
  2. Same answer as 1.
  3. You can create a separate TVM GraphModule for each model.
  4. In the TVM - Vitis AI flow, the CPU is waiting for the DPU to return results. I guess the CPU is utilised 100% because TVM will utilise as many resources as possible. However, you could limit the number of threads being used by TVM (Limit CPU cores for Auto tuned Model - #3 by sol401430). Btw, note that the heterogeneous TVM CPU - Vitis AI flow is different from the pure GPU flow I expect you are using.

Overall, I think documentation is lacking in this respect and we will try to improve that. Additionally, there are some issues with the different platforms (like multithreaded DPU hanging issue on Pynq) that we will try to get resolved. As mentioned earlier, we will be moving to the Vitis AI VART flow shortly and we will add more documentation and/or examples on this at the same time.

Thanks for your time and reply on this.

I am looking forward to see the Vitis AI VART workflow release ASAP.

Thanks and Regards, Raju