[RFC] Standalone Code Generation and C Runtime for STM32 bare-metal devices

stoa · April 1, 2021, 11:33am

Hello Andrew @areusch

In my mind, some setup function is needed to accomplish:

initializing memory set aside for tensors and parameters

configuring accelerators, including starting (possibly) backgrounded transfers of any programming/parameters.

I think that the TVM function for this is the factory function (right now, typically mod"default"), and the X-Cube equivalent is ai_[<model_name>_]create. Does that match your understanding?

This is exact.

Apologies, I think I was a bit confused before. IIUC, I think this port aims to implement an API aligned with the X-Cube API, at least for now only aiming to enable deployments to STM32–does that also seem right to you? I’m curious whether this API aims to replace the C runtime and Model-based Module Runtime Interface for all targets or if this would just be confined to STM32 for now.

If I am ambitious, I would say replace for a family of embedded targets. Sorry, I perhaps, was not clear earlier. We have observed several embedded tools converged on such API:

That seems a good argument to try also aligning the TVM C API in this direction. We probably need to change the naming, perhaps have tvm_ai_ instead of just ai_ - this is a detail. Important point is that there is a dozen of methods common to the above APIs and that the memory management is left to the main application to handle. I propose to start with the STM32 code emitter now and work together with the TIR-based AoT on converging to a common understanding. This will pave the way for us to move to the TIR-based code generator. We can perhaps also contribute to its development.

Then the next questions I have would be around how you’d like to proceed with this going forward. At present, the STM32 generator PR you’ve proposed has several features that are missing from the microTVM compiler (e.g. memory pinning, AOT, etc). As we implement these features, will it be possible to incorporate them into this generator as well (I.e. to take advantage of compiler-level improvements we might be able to make, such as graph-level optimization)?

This would be the plan. I can imagine a couple of things we can do with the TIR-based AoT that we cannot with our current code emitter.

If so, it would be great to keep the STM32 API semantically similar to the TVM C runtime API, so that we can later invoke TVM C runtime APIs from the STM32 functions. I suspect these are pretty similar, but just want to understand the goals for code-reviewing your PR. One possible scenario is: when we have a TVM AOT runtime and memory pinning available, we could rework ai_create to instantiate the TVM C AOT runtime. It would also be great to use the STM32 API as inspiration to expand the TVM APIs to provide equivalent functionality. Please let me know your thoughts here!

Corresponds entirely to our vision. Great !

So my question here is: in the future, woudl you be open to using a TVM-side implementation of a memory-pool, statically-allocated memory planner? I think it sounds like that’d be okay, but just confirming.

Yes. We will move away from the JSON graph and base the code emission on the TIR-based TVM structures, including the memory planner.

When we do tensor pinning, I think it’s likely I’ll propose to add some tensor_id (note: different from storage_id, as storage_id could contain multiple tensor_id) to TVMBackendAllocWorkspace, and a lookup table could just return a pointer into the pre-allocated memory pool. TVMBackendFreeWorkspace would become a no-op. Will that work for you guys?

That is good. Just keep in mind that these memory pools should be open to a static allocation as a section via a link script, to a static allocation as a table from the main application (.data), and to the dynamic allocation via whatever allocator the application may choose.

consider removing the need to use PackedFunc looked-up by string name, and instead provide more natural C wrappers around those functions

Already the case.

consider creating a mapping from PackedFunc string name to a global symbol name to shortcut this lookup, as they won’t likely be dynamically overridden in embedded applications.

We will add a API method for such lookup implementing the mapping.

Would it be possible to checkin a docker container e.g. tlcpack/ci-stm32 which could run this in our CI? Then we can just make it a first-class example and place in apps/microtvm/stm32 or a similar sub-directory of microtvm of your choosing.

Yes. Noted.

The Module Library Format seems not fully finalized yet That’s fine. I will generate the structure as per your RFC proposal (no crt), and we can refine it from there. This is a minor detail.

Actions for us:

Re-submit the PR with this:

Move to generating Module Library Format (as it is for now).
Provide the docker and a test application for the sanity CI.
Move to Project API on the demo side (structure + microtvm_api_server.py) implementing the Standalone Demo Project Generator based on your PoC.

We continue discussion on the C runtime API, how to involve the AoT people ? We can contribute to the development if necessary.

Does this work for you ?

Cheers

Arthur

areusch · April 1, 2021, 3:24pm

Great, a few final clarifications.

The Module Library Format seems not fully finalized yet That’s fine. I will generate the structure as per your RFC proposal (no crt), and we can refine it from there. This is a minor detail.

It is somewhat of a living standard, but it’s versioned. If you have tests for your implementation, we will run them as we make changes and bump Model Library Format version.

One clarification we do need to make here: Model Library Format is generated with the function tvm.micro.export_model_library_format, and the generated directory tree is given as an argument in Project API to generate_project. I think you should just need to modify your codegen to consume Model Library Format rather than also making a generator for it. Sorry if that was unclear, and let me know if something seems fundamentally broken with that approach.

Right now, Model Library Format includes graph executor configuration and so suggests the executor that should be used. I think you can just ignore that piece and/or use it to drive your codegen.

With all this said, we just have a PoC of Project API we’re developing now. Currently there is just a demo of an implementation for the host C runtime. The remaining items before committing the PoC are:

Develop the Zephyr API implementation
Migrate apps/bundle_deploy to use Project API

I’ll try to post the Zephyr implementation as a (loose) example (e.g. the Zephyr impl would not do runtime generation nor memory pinning) of what I’m thinking for STM32 codegen by end-of-week. Let me know what you think of this approach. We could expand the content of Model Library Format, if that was necessary for an STM32 implementation.

The benefit of doing this is that autotuning is going to use Project API to drive the build/flash/timing pipeline, so it would be a more natural shift as we move towards that. There is one additional detail not yet ironed out: the code you would want to generate for autotuning is very different from that you’d want to generate for inference. My vision for this was to have two different project generators (e.g. apps/microtvm/stm32/inference and apps/microtvm/stm32/autotune). In this proposal, the inference project would essentially be implemented as you guys have done now, and autotune would need to include the TVM RPC server and logic to drive the RPC transport over e.g. UART, USB, etc.

Let me know what you think of this idea.

I propose to start with the STM32 code emitter now and work together with the TIR-based AoT on converging to a common understanding. This will pave the way for us to move to the TIR-based code generator. We can perhaps also contribute to its development.

Great, that sounds good. Let’s discuss the API convergence in a follow-on RFC. I’m not sure I see exact unification on naming across frameworks, but I agree that the structure of our API is a bit divergent from the other embedded AI platforms. The API change will affect many, so we’ll need to have a focused discussion and loop in quite a few others.

@giuseros @ramana-arm, possible to give an update on the AOT progress?

When we do tensor pinning, I think it’s likely I’ll propose to add some tensor_id (note: different from storage_id, as storage_id could contain multiple tensor_id) to TVMBackendAllocWorkspace, and a lookup table could just return a pointer into the pre-allocated memory pool. TVMBackendFreeWorkspace would become a no-op. Will that work for you guys?

That is good. Just keep in mind that these memory pools should be open to a static allocation as a section via a link script, to a static allocation as a table from the main application (.data), and to the dynamic allocation via whatever allocator the application may choose.

Yeah this is all part of that. In particular, some accelerators may need a subset of parameters to live in a memory pool that lives at a fixed address for faster loading at startup.

consider removing the need to use PackedFunc looked-up by string name, and instead provide more natural C wrappers around those functions

Already the case.

We will add a API method for such lookup implementing the mapping.

Here, my goal is just to implement a simpler code-generation of tir.call_packed nodes which avoids a string lookup at inference time (e.g. avoids calling TVMBackendGetFuncFromEnv to do the string-lookup at inference time).

Actions for us:

Re-submit the PR with this:

Move to generating Module Library Format (as it is for now).

Provide the docker and a test application for the sanity CI.

Move to Project API on the demo side (structure + microtvm_api_server.py) implementing the Standalone Demo Project Generator based on your PoC.

We continue discussion on the C runtime API, how to involve the AoT people ? We can contribute to the development if necessary.

Does this work for you ?

Aside from (1), which I think can be generated with tvm.micro.export_model_library_format, that seems like a great plan to me!

I’ve tagged the AOT implementers, hopefully they can give a status update here.

-Andrew

giuseros · April 1, 2021, 4:49pm

Hi all,

I just published the AOT PR upstream: [AOT] Introducing AOT in TVM by giuseros · Pull Request #7785 · apache/tvm · GitHub.

It has some conflicts probably due to the GraphExecutor refactoring, and I will fix that soon. I wanted just to let you guys start to have a look

@stoa I am wondering how much of your work can use the AOT code generation in that PR.

Thanks, Giuseppe

giuseros · April 1, 2021, 4:41pm

Also, a side comment: I will be out for Easter holidays until Tuesday (so I will be replying back to any comments as soon as I come back )

stoa · April 2, 2021, 4:20pm

Hello, Andrew @areusch

Implementing the Project API, am encountering a couple of issues:

The generate_project script takes one tar ball with a model, multiple models do not seem to be supported. I would propose to add a function add_model(module_library_format_tar) that will add a given model to the project. Or any other solution you may prefer, so that a project could include multiple models ?
We have the stm32 runtime API code that is not included with the standalone_crt distribution. How do you propose the project accesses this runtime code ?
I can test the ProjectAPIHandler methods (from the microtvm_api_server.py) directly from a small script, verifying the functionality for generate_project, build, and flash. However, I have not found the project_api in the main branch, or is it ? How do you propose I continue ?

areusch · April 2, 2021, 4:50pm

Hi @stoa,

The generate_project script takes one tar ball with a model, multiple models do not seem to be supported. I would propose to add a function add_model(module_library_format_tar) that will add a given model to the project. Or any other solution you may prefer, so that a project could include multiple models ?

I think for multiple models, we should place them in a single IRModule prior to calling tvm.relay.build. However, we don’t have this well-supported just yet. @tqchen, more thoughts here?

We have the stm32 runtime API code that is not included with the standalone_crt distribution. How do you propose the project accesses this runtime code ?

The idea is to create a “template project” including this stm32 runtime code and microtvm_api_server.py. When generate_project is called, copy both the microtvm_api_server.py and the template code into the new project directory. Let me know if this seems okay to you guys.

I can test the ProjectAPIHandler methods (from the microtvm_api_server.py) directly from a small script, verifying the functionality for generate_project, build, and flash. However, I have not found the project_api in the main branch, or is it ? How do you propose I continue ?

Yeah sorry–I am just working on the Zephyr implementation myself. I hope to land my branch in the next week or two. Is that timeline ok for you?

Andrew

stoa · April 2, 2021, 9:32pm

I think for multiple models, we should place them in a single IRModule prior to calling tvm.relay.build . However, we don’t have this well-supported just yet. @tqchen, more thoughts here?

I would not impose that multiple models must be compiled together. Of course, compiling models together has the advantage of ‘inter-model’ optimizations, whaever this may be (sharing operators ?). On the other hand, there may be advantages to compiling models separately and reuse the results in different contexts/projects. I do not see a good reason for completely disallowing such separate compilation. Are there any ? Arthur

areusch · April 2, 2021, 11:32pm

I do not see a good reason for completely disallowing such separate compilation. Are there any ? Arthur

No there’s no problem with the idea of compiling them separately. We’d just need to make some changes to the compiler (e.g. allow exporting multiple top-level modules at once with Model Library Format). I think placing everything in one IRModule just requires the least hacks now. I’m not opposed to enabling multi-model compilation in TVM–just needs someone to put some cycles into it. I think this is mainly identifying the entry points in tvm.relay.build and doing some analysis to ensure codegen’d functions are disjoint.

We should spin up another RFC thread to discuss changes needed for that, if that’s something you’re interested in contributing!

-Andrew

manupa-arm · April 6, 2021, 11:39am

I’ve been out for for Holidays and apologize for catching up on this a bit later. Thanks @stoa for the proposal and for the discussion.

I’m interested in the memory management aspect of the RFC here.

We propose to leave a full freedom of memory management to the main application (no TVM integrated memory manager). This will enable standard and familiar memory management techniques, such as using linker scripts, for example. Another existing project that follows this direction is the project to estimate the memory footprint of the graph from TVMC ÂµTVM M2 Roadmap .

areusch:

When we do tensor pinning, I think it’s likely I’ll propose to add some tensor_id (note: different from storage_id, as storage_id could contain multiple tensor_id) to TVMBackendAllocWorkspace, and a lookup table could just return a pointer into the pre-allocated memory pool. TVMBackendFreeWorkspace would become a no-op. Will that work for you guys?

That is good. Just keep in mind that these memory pools should be open to a static allocation as a section via a link script, to a static allocation as a table from the main application (.data), and to the dynamic allocation via whatever allocator the application may choose.

Yeah this is all part of that. In particular, some accelerators may need a subset of parameters to live in a memory pool that lives at a fixed address for faster loading at startup.

@areusch @stoa based on the discussion that happened here, what is the current thinking as to who would be producing the address offset table in the case where user prefers TVM to figure out offsets from a single memory pool for all intermediary activations ?

Im looking at would there be an additional output to metadata.c/.o to hold the mapping between pinned tensors and their offsets.

Also @stoa, when you say dynamic allocation which granularity are we talking about ?

I mean would the application require to control the allocation of each and individual activation tensor or do you mean to say to decide where each pool (where all intermediary activations are allocated from) to be dynamically/statically allocated.

As a first step, we are starting to look at adding an interface to runtime.Module’s to be able to be queried for their workspace requirement to be consumed by the AoT executor initially. I will post a RFC soon. Is this something you guys have already looked at ?

stoa · April 7, 2021, 3:07pm

@manupa-arm

what is the current thinking as to who would be producing the address offset table in the case where user prefers TVM to figure out offsets from a single memory pool for all intermediary activations ?

I am not sure, I understand this question. Normally, the TVM figures our the tensor allocation inside the activation and params memory pools, currently storage_id.Do you mean something else ?

Im looking at would there be an additional output to metadata.c/.o to hold the mapping between pinned tensors and their offsets.

I feel like I am missing some information here. Can you explain the term “pinned tensor” ?

Also @stoa, when you say dynamic allocation which granularity are we talking about ?

I mean would the application require to control the allocation of each and individual activation tensor or do you mean to say to decide where each pool (where all intermediary activations are allocated from) to be dynamically/statically allocated.

Application should decide on the entire pool. The individual buffer allocation inside the pool is done by the compiler.

As a first step, we are starting to look at adding an interface to runtime.Module’s to be able to be queried for their workspace requirement to be consumed by the AoT executor initially. I will post a RFC soon. Is this something you guys have already looked at ?

This looks like getting the activations pool size and params pool size from the model. Yes.We have such API methods implemented.

At this point, it is difficult for us to evaluate on how close to what we are proposing the future TIR-based AoT might be.

The main point here is the C Runtime API that exposes the model to the main C application. Below, I am listing a few points derived from our experience with embedded ML development. We would like to be able to build a C API on top of the TVM AoT including these:

We expect the following pattern for deploying and running the model:

model_create_instance model_get_inputs model_get_outputs model_run_instance model_destroy_instance

This allows flexible models instantiation and handling. We expect being able to instantiate multiple copies of the same model, therefore some sort of instance handle/pointer will have to be used to access a particular copy.

We expect tensors (at least input/output) be augmented with the quantization information, so that the application can correctly setup their values.
We expect that the main application can setup the activation storage in two ways:

as a static block allocated in a specific ELF section
dynamically via whatever memory allocator is used by the application

In our implementation, we let the code emitter to instantiate the activations pool as a static block. Then we need to know the pool’s address from the model instance. For dynamic allocation, we expect to know the activations pool size from the model.

Input and output tensors may share memory with the activations and be placed with the activation pool. We need to be able to get these tensors addresses via get_input and get_output from the model instance. Application must have also the capability to provide its own buffer with specific HW specific alignment constraints to address the optimized UC (i.e. data produced or consummed by a HW IP, double-buffering scheme,…).
We expect parameters to be allocated as a static block in a specific ELF section.
For advanced debug/profiling purpose, a minimal additional mechanism (available in debug mode if overhead is considered) should be accessible by the application to register an user callback allowing:
- to measure the execution time of a given operator. Registered callback is called before or/and after the execution of the operator. Integrator point of view, the main open point is the identification/mapping of the executed operator vs operator from the “original” model.
- to have the capability to inject or to dump the tensor contents before or/and after the execution of the operator. model_register_callback
We provide a number of other model information useful for debugging and performance measurements: model name, number of operators in the graph, tools/api versions, etc. These are pretty nice to have while they are not difficult to supply.
In our implementation we also provide access to params pool via get_params and get_params_size. This is not complicated to provide while in the future it may be useful if main application needs to manipulate parameters for some sort of transfer learning or what have you.

The efficiency of the AoT generated graph functions is a secondary concern. A couple of points that may be worth mentioning:

It seems preferable to allocate tensors (not their storage) inside some ELF section, perhaps the .data section, not on stack. Usually, an embedded application developers need to size the stack. Having an unknown size bunch of bytes allocated by the AoT generator on stack may be perturbing to the familiar way of doing things. This is a relatively minor point.
The TVMBackendAllocate implementation should not be partt of the AoT. As I have explained, we prefer letting the application to decide on ALL memory allocations. Therefore, we should leave the TVMBackendAllocate implementation to the application code.

Hopefully, these points will help you to improve the AoT code generator.

stoa · April 7, 2021, 3:12pm

@areusch

Hello, Andrew

Let me try to summarize this RFC status:

There is a work on TIR-based AoT underway that covers pretty much what we are proposing.
The Runtime API for the standalone C code generation has not been finalized and is in a sort of open, experimental state.
You prefer to integrate our development as a micro TVM project complying with the Module Library Format and the Project API interfaces.

Moving forward:

The idea is to create a â€œtemplate projectâ€ including this stm32 runtime code and microtvm_api_server.py. When generate_project is called, copy both the microtvm_api_server.py and the template code into the new project directory. Let me know if this seems okay to you guys.

Yeah sorryâ€“I am just working on the Zephyr implementation myself. I hope to land my branch in the next week or two. Is that timeline ok for you?

This RFC proposed to contribute a C code generator and the API that we have developed for the STM32 targets to the ‘main’ TVM development branch. The idea is to have the TVM compiler target the STM32 boards (more boards are coming) and launch the STM32 developers on the TVM based tools. Putting the code emitter and the firmware-side API into a separate (from TVM) “template project” is somehow different from this original intention. More precisely, we want to put in place a compiler flow that can generate ML models implementations for the STM32 targets that our developers can use in their projects. We would prefer not to mix the compilation part together with the application part, as the “template-project” would imply. I can understand how integrating the code emitter with the TVM does not seem useful to you at this point (even as intermediate step while the AoT is not finalized):

considering the upcoming TIR based AoT
considering that the C Runtime API discussion has not been finalized

However, instead of integrating our code with the Project API as is, we prefer to package it together with tools on our side, at least until we can move to the TIR-based AoT. Moreover, it is also preferable for us to wait until the Module Library Format and the Project API mature and make their way to the ‘main’ TVM branch before integrating the STM32 applications. At some point, hopefully, we will be able to switch to the TIR-based AoT - we should keep the C Runtime API discussion open to avoid being too incompatible in terms of the firmware-side API. Then we would also integrate a STM32 project compliant with the TVM micro interface: the Module Runtime Format and the Project API.

What do you think ?

Multiple Models

We should spin up another RFC thread to discuss changes needed for that, if thatâ€™s something youâ€™re interested in contributing!

I cannot engage on this right now. If you launch such an RFC, please put me in CC, we will participate in the discussion.

areusch · April 7, 2021, 5:11pm

hi @stoa,

Thanks for the summary, I think that’s roughly correct. You’re right that things are changing fairly rapidly right now. I think even the Project API PR I sent you had become out of date by the time I sent it to you–so apologies for that.

Moving forward

I think your proposal makes sense–let me suggest a slight tweak to confirm my understanding:

The idea is to have the TVM compiler target the STM32 boards (more boards are coming) and launch the STM32 developers on the TVM based tools. Putting the code emitter and the firmware-side API into a separate (from TVM) “template project” is somehow different from this original intention.

So given that the main thing you’re trying to achieve right now is a code generator that produces an STM32-specific API, I can see how the Project API is a bit of a mismatch here. Specifically, you’re not attempting to generate a template project within TVM–it’s more accurate to characterize this as transforming the TVM compiler output to produce an STM32-compatible API.

I think there are two fairly separable pieces to your proposal here:

Adding a code-generator that produces models which implement the STM32 X-Cube AI API (e.g. ai_create, etc).
Reworking the TVM C Runtime APIs to more closely match the STM32 X-Cube API (which matches more closely to APIs from other embedded deployment tools–so therefore a direction in which microTVM should consider moving).

I think that piece #1 is fairly uncontroversial, and we’ve resolved the main challenges there (e.g. testing). Piece #2 will take longer, and more impacts the scope of the initial effort. Given the amount of development in progress now, it’ll be hard to settle on piece #2 until some of the core improvements (e.g. AOT, memory planning) land. So initially, let’s focus this RFC on merging piece #1.

Along those lines, I wonder if we could take a middle-ground approach here: the Model Library Format piece is merged to main. Is it possible to modify your code-generator to consume Model Library Format rather than using internal TVM APIs directly? If needed, we could make changes to Model Library Format to accommodate this change (e.g. you’ll be the first non-TVM use of it, so it wouldn’t surprise me if some parts need tweaking).

I think this would have some advantages:

It substantially reduces the footprint of your initial commit
It reduces exposure to the internal APIs, which may continue to change as TVM moves towards v1.0
It places platform-specific code behind the Model Library Format data structure, which helps to make sure that Model Library Format provides everything needed for a microTVM platform.
It makes future changes that may impact the STM32 code generator easier to implement e.g. AOT, memory pinning.

One question I have is around project generation, though. I do see that STM32 X-Cube AI supports project generation. From UM2536 section 1.2:

The X-CUBE-AI tool can generate three kinds of projects:

System performance project running on the STM32 MCU allowing the accurate measurement of the NN inference CPU load and memory usage

Validation project that validates incrementally the results returned by the NN, stimulated by either random or user test data, on both desktop PC and STM32 Arm® Cortex®-M-based MCU embedded environment

Application template project allowing the building of AI-based application

So just checking here–it seems like you do have some project generation facility. I could see how you prefer to keep project generation centralized within the larger STM X-Cube tool rather than invoking TVM via Project API. The one question that comes to mind is: do you intend to support autotuning efforts on-device? If so, at some point it’d be good to discuss a way forward to integrate the AutoTVM search tool with STM32 X-Cube project generation.

Other followups

Some additional follow-ups on comments from @manupa-arm and @stoa:

@areusch @stoa based on the discussion that happened here, what is the current thinking as to who would be producing the address offset table in the case where user prefers TVM to figure out offsets from a single memory pool for all intermediary activations ?

This is a great thing to discuss, because this same issue is also present in the AOT PR 7785. I’ll also raise this on the AOT RFC.

To provide some context:

Currently in microTVM, all memory allocation is handled dynamically. We don’t think this approach makes sense for a bare-metal environment–it’s just in there due to historical reasons and limitations on the TVM Graph Memory Planner.
In microTVM M2 Roadmap projects 5 and 7, we plan to overhaul Graph Memory Planner to support (likely) memory pools.
This would allow the user to provide, at the time of tvm.relay.build, a map of the available on-device memory to the TVM memory planner, and e.g. the output of tvm.relay.build will change such that each DLTensor referenced in the graph can be associated with a (memory_pool, offset) pair. Effectively, this “pins” each Tensor to a mostly-predefined location in memory.
This will remove the need for any dynamic memory allocation during inference. It also aligns effectively with what you guys have implemented. The advantage to doing this in TVM’s Graph Memory Planner is support for heterogeneous memory configurations e.g. that might be found with accelerators or multi-core SoC.

Currently in both this PR and in the AOT PR, memory pinning is handled outside the TVM compiler. I think this is a fine approach in the short-term, but we would obviously like to unify with TVM’s memory planner as it becomes sophisticated enough to drive these code generators.

As a first step, we are starting to look at adding an interface to runtime.Module’s to be able to be queried for their workspace requirement to be consumed by the AoT executor initially. I will post a RFC soon. Is this something you guys have already looked at ?

@manupa-arm I think we were planning to handle this by enabling GraphPlanMemory to traverse the whole-program TIR post-scheduling, including the generated AOT TIR. This should allow it to see all tir.allocate nodes and get a global view of the required memory. I think this would avoid us needing to add more compiler-specific stuff to runtime::Module, which will help us in the future.

I think this is a separate discussion than this RFC (but it would be great to get everyone’s input on that RFC).

We expect tensors (at least input/output) be augmented with the quantization information, so that the application can correctly setup their values.

@stoa I’m curious what this is used for specifically–wouldn’t the application already know this e.g. in a hardcoded pre-processing function? Or does this allow the application to implement a generic pre-processing function?

It seems preferable to allocate tensors (not their storage) inside some ELF section, perhaps the .data section, not on stack. Usually, an embedded application developers need to size the stack. Having an unknown size bunch of bytes allocated by the AoT generator on stack may be perturbing to the familiar way of doing things. This is a relatively minor point.

I think TVM has a limit on stack-allocated tensors, but we need to ensure it’s set correctly for µC. Likely, we need to configure this as e.g. a PassContext option.

The TVMBackendAllocate implementation should not be partt of the AoT.

I agree that tensor memory allocation and AoT are two separate things. We need to discuss this before merging AoT.

stoa · April 8, 2021, 12:50pm

@areusch @delorme-jm

Hello, Andrew

We are still discussing the way forward here internally. I do not think I understand how you propose to integrate our work at this point. Below, once more my understanding, thanks for patience

I think there are two fairly separable pieces to your proposal here:

Adding a code-generator that produces models which implement the STM32 X-Cube AI API (e.g. ai_create, etc).

Reworking the TVM C Runtime APIs to more closely match the STM32 X-Cube API (which matches more closely to APIs from other embedded deployment tools–so therefore a direction in which microTVM should consider moving).

I think that piece #1 is fairly uncontroversial, and we’ve resolved the main challenges there (e.g. testing). Piece #2 will take longer, and more impacts the scope of the initial effort. Given the amount of development in progress now, it’ll be hard to settle on piece #2 until some of the core improvements (e.g. AOT, memory planning) land. So initially, let’s focus this RFC on merging piece #1.

That’s clear.

Along those lines, I wonder if we could take a middle-ground approach here: the Model Library Format piece is merged to main. Is it possible to modify your code-generator to consume Model Library Format rather than using internal TVM APIs directly? If needed, we could make changes to Model Library Format to accommodate this change (e.g. you’ll be the first non-TVM use of it, so it wouldn’t surprise me if some parts need tweaking).

The inputs to our code generator do not create a problem: I have alredy experimented with the Model Library Format. The problem that I see is that the code generator itself needs to be placed together with the application project. Precisely:

You are right:

it seems like you do have some project generation facility.

So, our ML development flow is:

                --------------------
  Model -->     |      CubeAI      |
                --------------------
                    |          |
                    V          V
CubeMX Project +  C Code  + runtime  => Target
                           Libraries

For example, the demo that we’ve developed is based on such CubeAI generated project.

Now we are working on integrating the TVM with our CubeAI generator. The input is the model, the output is the C API. Internally, from the CubeAI prospective, the input to the code generator may be anything TVM generates (whether RuntimeModuleFactory or Model Library Format) and it is not visible from the project. Thus, the microTVM project will need to install the CubeAI tools in order to build the model implementation. When either such CubeAI is available, or we move onto the AoT code generator, we can propose a demo project within the microTVM framework. Like I said earlier, we prefer not to make the codegenerator+runtime part of the application project at this time.

I could see how you prefer to keep project generation centralized within the larger STM X-Cube tool rather than invoking TVM via Project API. The one question that comes to mind is: do you intend to support autotuning efforts on-device? If so, at some point it’d be good to discuss a way forward to integrate the AutoTVM search tool with STM32 X-Cube project generation.

Yes, we intend to use the AutoTuning. We have not looked at it closely, yet. I had made it work in our environment with your old microTVM - the host driven AutoTuning. That worked well, by the way. I am speculating here but we may not support user AutoTuning in the CubeAI - we probably will opt for building our AotuTuning database and make it accessible to the TVM via a git repository. The details will be clear when the microTVM autotune is released.

Concerning the quantization info:

I’m curious what this is used for specifically–wouldn’t the application already know this e.g. in a hardcoded pre-processing function? Or does this allow the application to implement a generic pre-processing function?

Basically, yes - it allows the application to not hardcode the quantization information but get it from the model.

It allows generic preprocessing
The more information you can get from the model, the more robust is your code for different models.
During development, it is easier to maintain changing quantization parameters between the quantized model, the python environment, and the C code.

Of course, if you need a really specific pre- or post- processing, the main application needs to be specific. Even in these cases, the quntization does not need to be hardwired. Bottom line: if you make a mistake with model input shape, you may get an error message, while you will just get wrong results if you make a mistake with the quantization parameters.

areusch · April 8, 2021, 3:32pm

@stoa @delorme-jm

Apologies for being unclear earlier, let me try to clarify.

The inputs to our code generator do not create a problem: I have alredy experimented with the Model Library Format. The problem that I see is that the code generator itself needs to be placed together with the application project.

This is where I should clarify–to me, Model Library Format and Project API are two different things:

Model Library Format (merged to main) specifies the layout of a .tar file that contains various parts of the compiled model. You generate Model Library Format with tvm.micro.export_model_library_format and it’s meant for either a) debugging or b) to be consumed by downstream project generators. Such downstream project generators will likely eventually mostly be Project API implementations, but this is not required.
Project API (not yet merged and still rough around the edges, as you rightly assessed) is an abstraction layer that puts typical platform-specific microTVM build tasks behind an API. Those are tasks like generate_project, build, flash, connect_rpc_server. Project API implementations would be typically co-located with application code (this is just a suggested convention, though, it doesn’t have to stick). Project API enables two different workflows:
1. standalone project generation for either deployment, experimentation, or measurement (this is similar to the purposes stated in the X-Cube generator UM2526 doc).
2. autotuning

So it seems to me that a good path forward (while we wait for e.g. AOT, memory planning, and Project API to get merged into main proper) would be to keep your code-generator in a Python script in the TVM repo. I’d suggest you consider having your script consume Model Library Format (which you can generate today at main with tvm.micro.export_model_library_format, rather than directly calling the TVM APIs.

This approach is roughly the same as what you’ve proposed in your PR, with the change that it would consume Model Library Format rather than the output of e.g. tvm.relay.build directly. If you need something more in Model Library Format, let’s just add it, because someone else will likely want it.

I think the main benefits of this are:

it moves your implementation further away from the core APIs, in case they drift
it benefits Model Library Format, as it would help to identify any shortcomings with the format (e.g. if it’s missing something you’d like, I think we should just add it).
if you decide to use the generic microTVM autotuning driver (e.g. from PR 7545) later on, you’ll need to make some Project API impl (even if it just shells out to X-Cube to do the actual generation). your Project API impl will receive the model in Model Library Format. So, this should help simplify this effort, as by this point you’d already have a project generator to start from which takes the same input given to you in autotuning.
finally, as we move on to Piece #2 (reworking C APIs to align with your X-Cube APIs), I suspect that having the same data available to all project generators will make that task easier to accomplish.

I think you could either place your code-generator in apps/microtvm/stm32 or in python/tvm/micro/contrib/stm32, even though it won’t look anything like python/tvm/micro/contrib/zephyr.py (we’ll move that into a Project API impl in apps/microtvm/zephyr shortly).

Yes, we intend to use the AutoTuning. We have not looked at it closely, yet. I had made it work in our environment with your old microTVM - the host driven AutoTuning. That worked well, by the way. I am speculating here but we may not support user AutoTuning in the CubeAI - we probably will opt for building our AotuTuning database and make it accessible to the TVM via a git repository.

Glad to hear this worked well. I think I’m also unsure as to whether autotuning would be an SoC vendor thing or an end-user thing. I’d still like to improve the autotuning infrastructure to make it easier to use–that benefits everyone. And, I think there could be potential situations where an end-user may want to try it, although I don’t have any specific known cases yet.

Basically, yes - it allows the application to not hardcode the quantization information but get it from the model.

Thanks for this clarification, that’s really helpful!

Let me know if this makes sense!

-Andrew

stoa · April 9, 2021, 7:25pm

@areusch @delorme-jm

Hello, Andrew

This can work. I will push a modified PR shortly. Thanks for the help.

Arthur

stoa · April 13, 2021, 8:05pm

@areusch

Hello, Andrew I have made a small test application testing our code emitter. It does not require a special docker - should work fine with the normal docker. How can I add it to the testsuite in order to run together with other tests ? I have placed it with tests/micro/stm32. Thanks in advance

areusch · April 13, 2021, 8:18pm

@stoa hey Arthur,

could you add it to tests/scripts/task_python_microtvm.sh if there aren’t any stm-specific deps? probably we should rename this to task_microtvm.sh, but that should be done in a separate PR.

-Andrew

ebad · August 6, 2021, 4:52pm

@stoa Is there any Demo Code available to try ? I see that demo was available as part of original PR but later removed. Is it available elsewhere ?

stoa · August 9, 2021, 10:48am

Hello @ebad We are still in the process of reviewing the PR by the TVM team. The demo exists but was removed from the PR because there is no infrastructure available for the TVM CI process to test this code. If we cannot make the demo application part of TVM apps, we will definitely make it available via some ST download. Can you tell me which ST board you are working with ? cheers Arthur

ebad · August 9, 2021, 2:11pm

I have a Nucleo-F746ZG and Nucleo-L4R5ZI.