MicroTVM for custom DSP

andrew_sto · July 15, 2021, 7:34pm

I’m trying to use TVM to produce inference code for a custom DSP chip. This chip is a multi-core architecture with memory hierarchy. It has a C++ compiler and supports OpenOCD. For now I’m writing new computes/schedules for conv2d, taking into account memory DMA to temporary SRAM buffers but also assembly intrinsics for SIMD. I’ll write the tvm Compiler and Flasher at some point later, I’m focusing on generating compatible C code for now. I overrode the relay conv2D nchw compute and I’m currently writing the schedule.

For DMA I’m planning on using the approach detailed in the VTA tutorial: adding “copy” compute ops, computed in one of the inner loops, and then having them pattern matched to DMA call itrinsics.

For multi-core computation I do not currently have an idea how to proceed.

Is the above approach the correct way to do DMA? How would one proceed to overlap DMA transfers with compute code ?
Any ideas how to do multi-core with IPC ? Currently calling .parallel on an axis in the schedule only moves around the allocation of a temporary buffer but does not even produce openmp or pthread calls. I was thinking I could just produce C code for one core and when I’m finally happy with it I can parallelize by hand.
How would you go through the produced schedule to check total memory size per layer? I especially need the memory footprint of SRAM buffers since I’m quite limited for that memory.

I am aware of the MicroTVM M2 RFC, but I don’t understand if any progress has been made on it.

Thanks, Andrei

areusch · November 8, 2021, 1:43am

@andrew_sto Apologies for missing this post.

First you might take a look at the Project API as this has now landed and obsoleted Compiler/Flasher.

Your approach should work, but I don’t think we want to ultimately take the dma_copy pragma approach in main for the c backend. We currently don’t model DMA in the TVM C runtime. In the C++ runtime, we can leverage it in a limited capacity via Device API. We do have a concept, storage_scope, which can be used in a limited fashion to model different memory scopes. In [RFC] Unified Static Memory Planning, @manupa-arm is implementing modeling of memories and this modeling will use the same identification method as storage_scope. Following this change, it should be possible then to describe memory-to-memory transfers and DMA properly with the C Device API.

The [pre-RFC] C Device API is the best way to do this once it lands. We don’t currently have good support for this, but you might be able to do something by leveraging .parallel() and implementing your own launch hook using tir.call_extern. It would be a bit of a hack.

This is something we want to provide but currently can only provide scratchpad information. Once USMP lands, we will provide this as a compiler output.

Hope this helps. Please feel free to update on any new concerns you might have!

Thanks,

Andrew