[RFC] Standalone Code Generation and C Runtime for STM32 bare-metal devices

areusch · April 7, 2021, 5:11pm

Thanks for the summary, I think that’s roughly correct. You’re right that things are changing fairly rapidly right now. I think even the Project API PR I sent you had become out of date by the time I sent it to you–so apologies for that.

Moving forward

I think your proposal makes sense–let me suggest a slight tweak to confirm my understanding:

The idea is to have the TVM compiler target the STM32 boards (more boards are coming) and launch the STM32 developers on the TVM based tools. Putting the code emitter and the firmware-side API into a separate (from TVM) “template project” is somehow different from this original intention.

So given that the main thing you’re trying to achieve right now is a code generator that produces an STM32-specific API, I can see how the Project API is a bit of a mismatch here. Specifically, you’re not attempting to generate a template project within TVM–it’s more accurate to characterize this as transforming the TVM compiler output to produce an STM32-compatible API.

I think there are two fairly separable pieces to your proposal here:

Adding a code-generator that produces models which implement the STM32 X-Cube AI API (e.g. ai_create, etc).
Reworking the TVM C Runtime APIs to more closely match the STM32 X-Cube API (which matches more closely to APIs from other embedded deployment tools–so therefore a direction in which microTVM should consider moving).

I think that piece #1 is fairly uncontroversial, and we’ve resolved the main challenges there (e.g. testing). Piece #2 will take longer, and more impacts the scope of the initial effort. Given the amount of development in progress now, it’ll be hard to settle on piece #2 until some of the core improvements (e.g. AOT, memory planning) land. So initially, let’s focus this RFC on merging piece #1.

Along those lines, I wonder if we could take a middle-ground approach here: the Model Library Format piece is merged to main. Is it possible to modify your code-generator to consume Model Library Format rather than using internal TVM APIs directly? If needed, we could make changes to Model Library Format to accommodate this change (e.g. you’ll be the first non-TVM use of it, so it wouldn’t surprise me if some parts need tweaking).

I think this would have some advantages:

It substantially reduces the footprint of your initial commit
It reduces exposure to the internal APIs, which may continue to change as TVM moves towards v1.0
It places platform-specific code behind the Model Library Format data structure, which helps to make sure that Model Library Format provides everything needed for a microTVM platform.
It makes future changes that may impact the STM32 code generator easier to implement e.g. AOT, memory pinning.

One question I have is around project generation, though. I do see that STM32 X-Cube AI supports project generation. From UM2536 section 1.2:

The X-CUBE-AI tool can generate three kinds of projects:

System performance project running on the STM32 MCU allowing the accurate measurement of the NN inference CPU load and memory usage

Validation project that validates incrementally the results returned by the NN, stimulated by either random or user test data, on both desktop PC and STM32 Arm® Cortex®-M-based MCU embedded environment

Application template project allowing the building of AI-based application

So just checking here–it seems like you do have some project generation facility. I could see how you prefer to keep project generation centralized within the larger STM X-Cube tool rather than invoking TVM via Project API. The one question that comes to mind is: do you intend to support autotuning efforts on-device? If so, at some point it’d be good to discuss a way forward to integrate the AutoTVM search tool with STM32 X-Cube project generation.

Other followups

Some additional follow-ups on comments from @manupa-arm and @stoa:

@areusch @stoa based on the discussion that happened here, what is the current thinking as to who would be producing the address offset table in the case where user prefers TVM to figure out offsets from a single memory pool for all intermediary activations ?

This is a great thing to discuss, because this same issue is also present in the AOT PR 7785. I’ll also raise this on the AOT RFC.

To provide some context:

Currently in microTVM, all memory allocation is handled dynamically. We don’t think this approach makes sense for a bare-metal environment–it’s just in there due to historical reasons and limitations on the TVM Graph Memory Planner.
In microTVM M2 Roadmap projects 5 and 7, we plan to overhaul Graph Memory Planner to support (likely) memory pools.
This would allow the user to provide, at the time of tvm.relay.build, a map of the available on-device memory to the TVM memory planner, and e.g. the output of tvm.relay.build will change such that each DLTensor referenced in the graph can be associated with a (memory_pool, offset) pair. Effectively, this “pins” each Tensor to a mostly-predefined location in memory.
This will remove the need for any dynamic memory allocation during inference. It also aligns effectively with what you guys have implemented. The advantage to doing this in TVM’s Graph Memory Planner is support for heterogeneous memory configurations e.g. that might be found with accelerators or multi-core SoC.

Currently in both this PR and in the AOT PR, memory pinning is handled outside the TVM compiler. I think this is a fine approach in the short-term, but we would obviously like to unify with TVM’s memory planner as it becomes sophisticated enough to drive these code generators.

As a first step, we are starting to look at adding an interface to runtime.Module’s to be able to be queried for their workspace requirement to be consumed by the AoT executor initially. I will post a RFC soon. Is this something you guys have already looked at ?

@manupa-arm I think we were planning to handle this by enabling GraphPlanMemory to traverse the whole-program TIR post-scheduling, including the generated AOT TIR. This should allow it to see all tir.allocate nodes and get a global view of the required memory. I think this would avoid us needing to add more compiler-specific stuff to runtime::Module, which will help us in the future.

I think this is a separate discussion than this RFC (but it would be great to get everyone’s input on that RFC).

We expect tensors (at least input/output) be augmented with the quantization information, so that the application can correctly setup their values.

@stoa I’m curious what this is used for specifically–wouldn’t the application already know this e.g. in a hardcoded pre-processing function? Or does this allow the application to implement a generic pre-processing function?

It seems preferable to allocate tensors (not their storage) inside some ELF section, perhaps the .data section, not on stack. Usually, an embedded application developers need to size the stack. Having an unknown size bunch of bytes allocated by the AoT generator on stack may be perturbing to the familiar way of doing things. This is a relatively minor point.

I think TVM has a limit on stack-allocated tensors, but we need to ensure it’s set correctly for µC. Likely, we need to configure this as e.g. a PassContext option.

The TVMBackendAllocate implementation should not be partt of the AoT.

I agree that tensor memory allocation and AoT are two separate things. We need to discuss this before merging AoT.