I’m trying to use TVM to produce inference code for a custom DSP chip. This chip is a multi-core architecture with memory hierarchy. It has a C++ compiler and supports OpenOCD. For now I’m writing new computes/schedules for conv2d, taking into account memory DMA to temporary SRAM buffers but also assembly intrinsics for SIMD. I’ll write the tvm Compiler and Flasher at some point later, I’m focusing on generating compatible C code for now. I overrode the relay conv2D nchw compute and I’m currently writing the schedule.
For DMA I’m planning on using the approach detailed in the VTA tutorial: adding “copy” compute ops, computed in one of the inner loops, and then having them pattern matched to DMA call itrinsics.
For multi-core computation I do not currently have an idea how to proceed.
-
Is the above approach the correct way to do DMA? How would one proceed to overlap DMA transfers with compute code ?
-
Any ideas how to do multi-core with IPC ? Currently calling .parallel on an axis in the schedule only moves around the allocation of a temporary buffer but does not even produce openmp or pthread calls. I was thinking I could just produce C code for one core and when I’m finally happy with it I can parallelize by hand.
-
How would you go through the produced schedule to check total memory size per layer? I especially need the memory footprint of SRAM buffers since I’m quite limited for that memory.
I am aware of the MicroTVM M2 RFC, but I don’t understand if any progress has been made on it.
Thanks, Andrei