New backend for microTVM?

Hello,

we have a large system that is composed by a mesh of processing elements, each of them having an ARM processor and supporting ML accelerators. We already have a software stack that deals with graph partitioning and mapping for other unrelated tasks, but we want it to support TVM. However, despite the system being globally powerful, its processing elements have the same local limitations as the bare metal devices covered by microTVM.

I would like to ask for your feedback regarding where you think the system would fit. I would highly appreciate any recommendation regarding the inclusion of such backend.

Thanks for your time.

Hector G

hi @hagonzalezdvb thanks for posting up your question! It sounds like a fairly interesting system. It’s hard to say exactly without knowing more details, but it seems like:

  • If each processing element didn’t need to worry about any accelerators (e.g. was just a single-core ARM CPU), you could simply model each element as a single Relay model and wrap our GraphExecutor or (soon) AotExecutor in some way as to be compatible with your runtime. Since you already define a partitioner and mapper, I presume you also have some runtime component which can coordinate the cores (e.g. load code and tensors and dispatch tasks)
  • Should each processing element have additional CPUs or accelerators, you can use the same approach but the microTVM side gets a bit more complex. This side isn’t fully implemented yet. See the [pre-RFC] C Device API thread for more on supporting generically heterogenous systems from the C runtime. However, if you just have a single CPU with an accelerator and you want to synchronously offload compute to the accelerator, you could probably take the same approach being used for the Ethos-U accelerator (e.g. use tir.call_extern to invoke the driver directly from TVM).
  • Apart from the TVM RPC system, if memory serves we don’t have a runtime component right now which could coordinate all of the various cores in your system.
  • You might also see the pipelined GraphExecutor work done by @hjiang

Let me know if this helps.

-Andrew

Hello @hagonzalezdvb

You can check my post: Can TVM split work into different layers, and assign layers into different cores? - Apache TVM Discuss. For instance, now I am able to run 4 layers neural network, by assigning the first 2 layers into 3 Big cores, the third layer into another big core, and the last layer into 4 small cores.

Hung-Yang

Hello @areusch

thanks for your prompt response and your helpful suggestions. Regarding your comments:

  • Our processing element has one accelerator and it is treated as an extension of the ARM functionality via C-wrappers, which implies that your idea of the single Relay model per PE could be a great initial alternative for us to try. Yes, we have a runtime component that takes care of the core coordination.
  • This second option is also good for us to keep it in mind, as our system can also have this situation because the accelerators can operate independently of the cores.
  • Regarding the TVM RPCs system, I was thinking that we would need to allocate certain PEs within the large extension of the machine to provide the RPC management for a subset of PEs. I am not sure if that makes sense to you. To give an idea of the scale, a single deployment board composing our system would have 7296 PEs.
  • I will look into this pipelined GraphExecutor.

Thanks for your kind help.

Hector G

Hello @popojames

Thank you very much for sharing with me your post. It looks very similar to what we want to achieve with TVM. If I understand correctly, the benefit of using a subgraph over the single Relay model per PE would be that with the subgraph you can partition a network to be processed by multiple PEs, whereas in the single Relay model per PE you would generate individual graphs from individual models. Is that right @areusch @popojames?

Thanks to both of you for your kind help. Hector G

great. @popojames’ suggestion looks pretty good while we work out the issues with C Device API and heterogeneous execution in the C Runtime. We do not currently have any RFC yet which directly addresses heterogenous execution, but would very much appreciate both of your comments on it when it is raised!