Standalone code generator and C runtime for STM32 bare-metal devices
Background
This RFC aims to collect the TVM community feedback on the following subjects:
- Standalone compilation targeting embedded bare-metal platforms
- ML user API for embedded applications
- Integration of the TVM with the standard embedded development tools and projects
The RFC falls into the micro TVM line of development and compliments projects outlined in the µTVM M2 Roadmap, in particular these two:
- AoT, proposing a standalone code generator for embedded targets, and which has been outstanding in the TVM community for a while now.
- Project API, a recent RFC proposing a standard “interface layer” between the TVM and the generated embedded firmware code.
This RFC has an associated PR implementation including a demo application that has been tested on a number of ML models with the STM32 Discovery ARM based development board. The PR also serves as a Proof-Of-Concept for the concepts outlined in the above AoT RFC.
Objectives
The first objective of this proposal is to move forward in implementing the standalone compilation flow from TVM targeting the embedded and bare-matal devices. As stated with the AoT, having to interpret a JSON graph at runtime is a problem in embedded and bare-metal environments:
- The workflow is very hard to implement on a micro-controller, since memory is usually a costly resource in embedded environments, and the json file is usually quite large.
- The memory allocation in the current TVM stack is split, with inter-operator memory managed at json/relay level while the intra-operator memory is managed at TIR level.
Additionally,
- JSON handling incurrs extra processing overhead
- Dynamic library handling incurs extra processing and memory overhead
- Data placement in memory, given a very diversified and specialized set of memory hierachies, is difficult to handle.
Indeed, the embedded application deployment flow is different from TVMs modules deployment via a JSON graph and a dynamically loaded operators library. A typical application deployment in resource-constraint embedded environments is done via downloading a standalone binary executable image on the target device. From the user prospective, the ML model is embedded inside a larger main application. In such environment, the resource management (memory, etc.) is handled by this main application.
The issue has been first addressed in the AoT, which proposes the generation of a standalone C implementation for ML models, and the definition of an associated C runtime API. Our RFC proposal is different from the AoT in two ways:
- Our approach is more lightweight in terms of the engineering and development effort: our code emitter takes the TVM generated JSON graph as input and seats on top of the TVM module, while the AoT implements a full blown code generator integrated with the TVM TIR representation. The two approaches may be complimentary to each other as the lightweight code emitter allows a quick and un-intrusive putting in place a code generator for a new target.
- We propose a richer embedded ML API drawn from two well established and robust development frameworks, the X-CUBE-AI and the TensorFlow Lite for Microcontrollers. This API closely follows the current industry trends and will benefit wide TVM adoption.
The AoT is currently the work in progress. In the meantime, we have developed a working implementation of the standalone embedded development flow for the STM32 microcontrollers. We propose to integrate this development into the TVM framework at least as an intermediate step until the fully functional AoT is implemented, and we can put in place a STM32 specific AoT code generator. This will enable:
- A quick access to the STM32 development for the TVM community boosting the TVM integration with the STM32 development tools.
- We will probably need to develop not one but a number of standalone code generators. For example, a sequential executor such as we generate with this RFC will likely not fit a multi-core target platform, where operators may need to be wrapped into some sort of threading code; or for an accelerator enabled platform where it may be necessary to generate some communication and synchronization code. Therefore, the lightweight approach will enable quick and early implemention of new code generators for different target platforms.
The memory management issue is not yet fully addressed within the TVM framework. Typically, in an embedded environment, the main application requires full and fine control of the memory management. From the AoT, the main application would have a limited data placement possibility constrained by the implementation of the runtime memory manager. We propose to leave a full freedom of memory management to the main application (no TVM integrated memory manager). This will enable standard and familiar memory management techniques, such as using linker scripts, for example. Another existing project that follows this direction is the project to estimate the memory footprint of the graph from TVMC µTVM M2 Roadmap.
Finally, in embedded application development environment, the TVM needs to be integrated with the standard embedded development flows, such as the STM32CubeMX, for example. Such frameworks typically include a large set of tools that are outside of the scope of the TVM (target board HW configuration, etc.). The issue is considered in Project API, which proposes to introduce a new Project API with the main goal to allow TVM to drive builds on firmware platforms for the purpose of AutoTVM. Our proposed PR implements a number of building blocks that fit well the Project API framework.
Below, we explain our proposed approach in details and highlight some differences from the earlier RFC proposals.
Standalone Code Generation
The TVM compiler generates three objects:
- The JSON graph of the ML model
- The C library of the kernels (targetted at the arm devices for the STM32 platforms)
- The params dictionary
In order to enable the standalone code generation that fits better current existing embedded development practice, we propose following approach:
- Perform the JSON file processing at compile time, instead of at runtime. This is achived by implementing a code emitter that, given a TVM Module, generates a standalone C implementation of the graph processing for a given target platform.
- Define a runtime C API that exposes graph processing functions to the main application.
- Leave entirely the memory management and data placement to the main application.
Code Emitter
We propose to build a standalone C implementation of ML models from the TVM Module, instead of processing the JSON graph at runtime. This implementation is generated by the code emitter that seats on top of TVM Module and is implemented in Python. The code emitter currently targets the STM32 microcontrollers.
The C implementation is exposed to the application via the ai_model_info
descritor of the compiled model:
typedef struct {
const char * name;
const char * datetime;
const char * revision;
const char * tool_version;
const char * api_version;
uint16_t n_nodes;
uint8_t n_inputs;
uint8_t n_outputs;
uint32_t activations_size;
uint32_t params_size;
ai_ptr activations;
ai_tensor ** inputs;
ai_tensor ** outputs;
const ai_ptr (*ai_get_params)(void);
ai_status (*ai_create)(const ai_ptr weights, const ai_ptr activations);
ai_status (*ai_destroy)();
ai_status (*ai_run)(ai_tensor *input[], ai_tensor *output[]);
} ai_model_info;
The code emitter generates the C code including:
- Instantiation of all tensors (activations and weights). The tensors
data
fields (the data buffer addresses) remain un-assigned until the runtime. - A small number of interface functions for model deployment and execution
The code emitter optionally instantiates the built-in ‘activations’
memory pool (see Memory Management below).
In this case, the ai_model_info.activations
contains the address of the built-in pool,
otherwise NULL.
Model inputs/outputs data can also be optionally allocated in this memory
pool, sharing memory with the model activation buffers.
The emitter generates following interface functions:
ai_get_params : returns the runtime memory address of the params
ai_create : instantiates a model in device memory
ai_destroy : removes a model instance from the device memory
ai_run : executes the model graph, calling operators from kernels lib
Our implementation is fairly similar to the one proposed in the AoT with the following diferences:
- Our
ai_model_info
model descriptor contains more information compared to thetvm_model_t
descriptor from the AoT. Additional information proposition is drawn from our experience with the X-CUBE-AI and TensorFlow Lite for Microcontrollers tools. - In addition to the
operators.c
(model kernels implementation) and thenetwork.c
(model graph implementation), we also generate thenetwork_data.c
containing a table with model parameters (weights). This table is assigned to the ‘params’ memory pool (see Memory Management below) and, at link time, is allocated an application-specified memory region via the linker script.
A STM32 code emitter for the STM32 MCU based boards has been implemented and can be seen here: PR. Similar emitters can be quickly created targeting any other platform, for example a multi-core parallel platform.
Memory management
The ML model memory is managed via memory pools. Model activations are placed into the ‘activations’ pool, model params are placed into the ‘params’ pool. The ‘activations’ memory pool can be setup by the main application or built-in with the model at the model generation time. The ‘params’ memory pool is setup at the model generation time. Statically setup pools are allocated memory at link time via the application-specified linker script. The ‘activations’ memory pool can also be dynamically allocated at runtime by the main application on the heap.
The application manages its memory allocation via several mechanisms:
- The TVM compiler communicates the number of activations and params tensors and their buffer assignment via the ‘storage id’ JSON graph attribute.
- The code emitter assigns the application data, ‘activations’ and ‘params’ pools to dedicated ELF sections (except for dynamically allocated data).
- Linker performs the placement of ELF sections based on application-specified linker script. Arbitrary target platform memory hierarchy is thus supported: FLASH, RAM, external, internal, etc., without that the TVM have explicit knowledge of it.
- The main application manages any static or dynamic runtime memory allocation that can be required. For example, it may be necessary that two models share their ‘activation’ pools, or that two instances of the same model have separate input and output buffers, etc.
The Runtime C API
In a typical embedded application use-case, a ML model is managed under the control of the main application, more precisely:
- the model is placed in memory (activations, weights, heap)
- the model is given inputs
- the model is run
- the outputs are recovered by the main application for further processing
We propose a slim runtime API for developing the embedded standalone ML applications drawn from our experience with the X-CUBE-AI and the TensorFlow Lite for Microcontrollers tools. The objectives being:
- Efficient implementation in terms of performance and minimalist memory footprint.
- Memory management under the control of the main application. For example, any runtime memory allocations can be avoided by statically placing data to appropriate memory regions at link time. This enables an easy experimentation with the data placement, and flexibility.
- The possibility to build multi-model applications combining separately compiled models. These models can optionally share their activation and/or inputs/outputs memory.
- The possibility to include multiple instantiations of the same model in a single application.
- Enable a generic main application with all model-specific information available from the model implementation.
Our slim runtime API provides access to the TVM generated model implementation via a small model interface.
First, the ai_model_info
descriptor is directly visible from the main
application. It holds all information about the
model. For example, such information includes the number of model inputs and
outputs, associated tensors, their types and shapes, etc.
Details are available from this PR.
Several models can be linked together into a single application, each one with
its own model descriptor.
A model descriptor is instantiated into a deployed model instance by calling the function:
ai_status ai_create (ai_model_info * nn, ai_ptr activations, ai_handle *handle);
The function returns a particular instance of a model, which is an
obscure handle
hiding current implementation details. During the ai_create
call, the
data
fields for the activations and params tensors (their buffers addresses)
are setup.
The size and memory address of the ‘activations’ and ‘params’ pools can be retrived at runtime with:
uint32_t ai_get_activations_size (ai_handle handle);
ai_ptr ai_get_activations (ai_handle handle);
uint32_t ai_get_params_size (ai_handle handle);
const ai_ptr ai_get_params (ai_handle handle);
We propose to extend the DLTensor
with additional quantization information:
typedef struct {
/*!
* \brief The TVM tensor.
*/
DLTensor dltensor;
/*!
* \brief The quantization info, if quantized
*/
ai_quantization_info * quant;
} ai_tensor;
The quantization information is needed by the main application for processing model inputs and outputs. There may be one additional use - debugging/monitoring the intermediate activations, but it is still unclear how useful this can be.
The main application can query a model instance for a number of informations, such as:
int32_t ai_get_input_size (ai_handle handle);
int32_t ai_get_output_size (ai_handle handle);
ai_tensor * ai_get_input (ai_handle handle, int32_t index);
ai_tensor * ai_get_output (ai_handle handle, int32_t index);
etc.
The ai_run
function executes the TVM model graph mimiking the GraphRuntime
execution:
ai_status ai_run (ai_handle handle);
For the current STM32 target, this is a simple sequential single processor execution that calls each model kernel one at a time.
All API functions return an ai_status
value and set the TVMLastError
in
case of a problem. This can be retrieved by the main application via:
const char * ai_get_error (ai_handle handle);
The above runtime API is more explicit compared to the one proposed by the AoT, which proposes a minimalist runtime C API, consisting mainly of two functions:
// Helper function to initialize a DLTensor
DLTensor TVMInitializeDLTensor(void *data, DLDataType* dtype, DLContext* ctx, int64_t* shape, int64_t num_dim);
// Helper function to run the `run_func` within the generated library network.o.
tvm_crt_error_t TVMRuntime_Run(tvm_model_t *model, DLTensor *inputs, int num_inputs, DLTensor *outputs, int num_outputs);
We make several observations:
-
Full information about the compiled model is not availale.
-
Some useful functionalities are missing, for example, the input/output quantization information.
-
The memory allocator manager is not under the main application control. In the embedded development flow this is a critical point - the memory management is typically handled by the main application.
-
The inputs/outputs buffers cannot be shared with the activations storage, which can be important for memory footprint reduction for small models.
In both RFCs, the model implementation is exposed to the main applicationthe via a slim API layer. However, this RFC API is richer giving more flexibility, in particular for the memory management. Another minor difference is that we do not create or manage model tensors, they are built-in with the model implementation. However, the API provides the main application with functions for accessing these tensors and managing their storage.
Example
ai_handle handle; /* instance of the model */
ai_ptr data_in; /* reference for the input buffer */
ai_ptr data_out; /* reference for the output buffer */
void ai_init(void)
{
/* AI associated Configuration */
...
/* discover an AI model from current application */
ai_model_info *nn = ...
/*
* ai_create calls model-specific create function.
*/
err = ai_create(nn, AI_MODEL_activations(nn), &handle);
if (err != AI_STATUS_OK) {
...
}
/* handle is globally set, if no error */
/*
* Allocate input/output tensors
*/
/* sanity IO number check */
if (ai_get_input_size(handle) != 1 ||
ai_get_input_size(handle) != 1)
return -1;
DLTensor *dl_tensor;
ai_tensor *input_tensor = ai_get_input(handle, 0);
dl_tensor = get_dltensor(input_tensor);
/* built-in allocated tensor? */
if (dl_tensor->data == NULL) {
uint32_t bytes = get_tensor_size (input_tensor);
dl_tensor->data = (ai_ptr)malloc(bytes);
}
data_in = dl_tensor->data;
ai_tensor *output_tensor = ai_get_output_size(handle, 0);
dl_tensor = get_dltensor(input_tensor);
if (dl_tensor->data == NULL) {
uint32_t bytes = get_tensor_size (output_tensor);
dl_tensor->data = (ai_ptr)malloc(bytes);
}
data_out = dl_tensor->data;
return;
void ai_deinit() {
/* release the allocate resources (if necessary) */
...
/* deallocate the model instance */
err = ai_destroy(handle);
if (err != AI_STATUS_OK) {
...
}
}
int main(void)
{
/* MCU Configuration */
...
/* Model Init */
ai_init();
/* Main process loop */
while (cond) {
/* 1 - Acquire, pre-process and fill the input buffers */
acquire_and_pre_process_data(data_in);
/* 2 - Call inference engine */
err = ai_run(handle);
if (err != AI_STATUS_OK) {
...
}
/* 3 - Post-process the predictions */
post_process(data_out);
}
ai_deinit();
}
Relation to the Project API RFC
This RFC has two components:
- The STM32 code emitter and its associated runtime support described above
- The STM32 demo application
The first component, the STM32 code emitter and its runtime, belongs to the compiler system (TVM) rather then to a separate standalone project. The code emitter takes a TVM Module and generates a C implementation of the model graph. It is tightly-coupled to the TVM code base. The code emitter is also dependent on a particular runtime support, similarly to a C compiler, eg. gcc based on gcc runtime libraries. Preferably, the objective here would be to have a generic runtime API that fits different target platforms and deployment scenarios, while the implementation would be target-specific (similar to the GraphRuntime). However, we can imagine a variety of deployment scenarios and execution models, which may require different runtime APIs. This point is still to be clarified.
The second component, the STM32 demo application, fits well with the Project API proposal, roughly following the ‘Standalone Demo Project Generator’ flow. It may be considered as implementing two of the Project API building blocks:
- A project template
- A transport layer
The demo application can be eventually integrated with the Project API, as well as within the upcoming AutoTuning infrastructure.
Conclusion
In this RFC we outlined a proposal for the standalone code generation for ML models in embedded and bare-metal development environments. A PR targeting the STM32 microcontrollers is also available. The proposal falls in the line of developments already underway in the TVM community:
-
AoT code generation: We propose a complimentary, more lightweight approach. A C code for the model is generated enabling standard embedded development flow. We expose more model information to the main application compared to the AoT. Out lightweight approach can be used to quickly develop standalone code generators for new targets.
-
Embedded Runtime C API: We propose a richer application API compared to the AoT, based on our experience with an industrial embedded development environment.
-
Project Integration: We propose a STM32 demo application that has been tested on a number of ML models with the STM32 Discovery ARM based development board. We propose to contribute several building blocks that can be integrated with the framework from Project API.
Please share your thoughts/feedback!