[RFC] Standalone Code Generation and C Runtime for STM32 bare-metal devices

stoa · March 29, 2021, 4:00pm

Standalone code generator and C runtime for STM32 bare-metal devices

Background

This RFC aims to collect the TVM community feedback on the following subjects:

Standalone compilation targeting embedded bare-metal platforms
ML user API for embedded applications
Integration of the TVM with the standard embedded development tools and projects

The RFC falls into the micro TVM line of development and compliments projects outlined in the ÂµTVM M2 Roadmap, in particular these two:

AoT, proposing a standalone code generator for embedded targets, and which has been outstanding in the TVM community for a while now.
Project API, a recent RFC proposing a standard “interface layer” between the TVM and the generated embedded firmware code.

This RFC has an associated PR implementation including a demo application that has been tested on a number of ML models with the STM32 Discovery ARM based development board. The PR also serves as a Proof-Of-Concept for the concepts outlined in the above AoT RFC.

Objectives

The first objective of this proposal is to move forward in implementing the standalone compilation flow from TVM targeting the embedded and bare-matal devices. As stated with the AoT, having to interpret a JSON graph at runtime is a problem in embedded and bare-metal environments:

The workflow is very hard to implement on a micro-controller, since memory is usually a costly resource in embedded environments, and the json file is usually quite large.
The memory allocation in the current TVM stack is split, with inter-operator memory managed at json/relay level while the intra-operator memory is managed at TIR level.

Additionally,

JSON handling incurrs extra processing overhead
Dynamic library handling incurs extra processing and memory overhead
Data placement in memory, given a very diversified and specialized set of memory hierachies, is difficult to handle.

Indeed, the embedded application deployment flow is different from TVMs modules deployment via a JSON graph and a dynamically loaded operators library. A typical application deployment in resource-constraint embedded environments is done via downloading a standalone binary executable image on the target device. From the user prospective, the ML model is embedded inside a larger main application. In such environment, the resource management (memory, etc.) is handled by this main application.

The issue has been first addressed in the AoT, which proposes the generation of a standalone C implementation for ML models, and the definition of an associated C runtime API. Our RFC proposal is different from the AoT in two ways:

Our approach is more lightweight in terms of the engineering and development effort: our code emitter takes the TVM generated JSON graph as input and seats on top of the TVM module, while the AoT implements a full blown code generator integrated with the TVM TIR representation. The two approaches may be complimentary to each other as the lightweight code emitter allows a quick and un-intrusive putting in place a code generator for a new target.
We propose a richer embedded ML API drawn from two well established and robust development frameworks, the X-CUBE-AI and the TensorFlow Lite for Microcontrollers. This API closely follows the current industry trends and will benefit wide TVM adoption.

The AoT is currently the work in progress. In the meantime, we have developed a working implementation of the standalone embedded development flow for the STM32 microcontrollers. We propose to integrate this development into the TVM framework at least as an intermediate step until the fully functional AoT is implemented, and we can put in place a STM32 specific AoT code generator. This will enable:

A quick access to the STM32 development for the TVM community boosting the TVM integration with the STM32 development tools.
We will probably need to develop not one but a number of standalone code generators. For example, a sequential executor such as we generate with this RFC will likely not fit a multi-core target platform, where operators may need to be wrapped into some sort of threading code; or for an accelerator enabled platform where it may be necessary to generate some communication and synchronization code. Therefore, the lightweight approach will enable quick and early implemention of new code generators for different target platforms.

The memory management issue is not yet fully addressed within the TVM framework. Typically, in an embedded environment, the main application requires full and fine control of the memory management. From the AoT, the main application would have a limited data placement possibility constrained by the implementation of the runtime memory manager. We propose to leave a full freedom of memory management to the main application (no TVM integrated memory manager). This will enable standard and familiar memory management techniques, such as using linker scripts, for example. Another existing project that follows this direction is the project to estimate the memory footprint of the graph from TVMC ÂµTVM M2 Roadmap.

Finally, in embedded application development environment, the TVM needs to be integrated with the standard embedded development flows, such as the STM32CubeMX, for example. Such frameworks typically include a large set of tools that are outside of the scope of the TVM (target board HW configuration, etc.). The issue is considered in Project API, which proposes to introduce a new Project API with the main goal to allow TVM to drive builds on firmware platforms for the purpose of AutoTVM. Our proposed PR implements a number of building blocks that fit well the Project API framework.

Below, we explain our proposed approach in details and highlight some differences from the earlier RFC proposals.

Standalone Code Generation

The TVM compiler generates three objects:

The JSON graph of the ML model
The C library of the kernels (targetted at the arm devices for the STM32 platforms)
The params dictionary

In order to enable the standalone code generation that fits better current existing embedded development practice, we propose following approach:

Perform the JSON file processing at compile time, instead of at runtime. This is achived by implementing a code emitter that, given a TVM Module, generates a standalone C implementation of the graph processing for a given target platform.
Define a runtime C API that exposes graph processing functions to the main application.
Leave entirely the memory management and data placement to the main application.

Code Emitter

We propose to build a standalone C implementation of ML models from the TVM Module, instead of processing the JSON graph at runtime. This implementation is generated by the code emitter that seats on top of TVM Module and is implemented in Python. The code emitter currently targets the STM32 microcontrollers.

The C implementation is exposed to the application via the ai_model_info descritor of the compiled model:

typedef struct {
  const char          * name;
  const char          * datetime;
  const char          * revision;
  const char          * tool_version;
  const char          * api_version;
  uint16_t              n_nodes;
  uint8_t               n_inputs;
  uint8_t               n_outputs;
  uint32_t              activations_size;
  uint32_t              params_size;
  ai_ptr                activations;
  ai_tensor          ** inputs;
  ai_tensor          ** outputs;
  const ai_ptr (*ai_get_params)(void);
  ai_status (*ai_create)(const ai_ptr weights, const ai_ptr activations);
  ai_status (*ai_destroy)();
  ai_status (*ai_run)(ai_tensor *input[], ai_tensor *output[]);
} ai_model_info;

The code emitter generates the C code including:

Instantiation of all tensors (activations and weights). The tensors data fields (the data buffer addresses) remain un-assigned until the runtime.
A small number of interface functions for model deployment and execution

The code emitter optionally instantiates the built-in ‘activations’ memory pool (see Memory Management below). In this case, the ai_model_info.activations contains the address of the built-in pool, otherwise NULL. Model inputs/outputs data can also be optionally allocated in this memory pool, sharing memory with the model activation buffers.

The emitter generates following interface functions:

ai_get_params : returns the runtime memory address of the params

ai_create : instantiates a model in device memory

ai_destroy : removes a model instance from the device memory

ai_run : executes the model graph, calling operators from kernels lib

Our implementation is fairly similar to the one proposed in the AoT with the following diferences:

Our ai_model_info model descriptor contains more information compared to the tvm_model_t descriptor from the AoT. Additional information proposition is drawn from our experience with the X-CUBE-AI and TensorFlow Lite for Microcontrollers tools.
In addition to the operators.c (model kernels implementation) and the network.c (model graph implementation), we also generate the network_data.c containing a table with model parameters (weights). This table is assigned to the ‘params’ memory pool (see Memory Management below) and, at link time, is allocated an application-specified memory region via the linker script.

A STM32 code emitter for the STM32 MCU based boards has been implemented and can be seen here: PR. Similar emitters can be quickly created targeting any other platform, for example a multi-core parallel platform.

Memory management

The ML model memory is managed via memory pools. Model activations are placed into the ‘activations’ pool, model params are placed into the ‘params’ pool. The ‘activations’ memory pool can be setup by the main application or built-in with the model at the model generation time. The ‘params’ memory pool is setup at the model generation time. Statically setup pools are allocated memory at link time via the application-specified linker script. The ‘activations’ memory pool can also be dynamically allocated at runtime by the main application on the heap.

The application manages its memory allocation via several mechanisms:

The TVM compiler communicates the number of activations and params tensors and their buffer assignment via the ‘storage id’ JSON graph attribute.
The code emitter assigns the application data, ‘activations’ and ‘params’ pools to dedicated ELF sections (except for dynamically allocated data).
Linker performs the placement of ELF sections based on application-specified linker script. Arbitrary target platform memory hierarchy is thus supported: FLASH, RAM, external, internal, etc., without that the TVM have explicit knowledge of it.
The main application manages any static or dynamic runtime memory allocation that can be required. For example, it may be necessary that two models share their ‘activation’ pools, or that two instances of the same model have separate input and output buffers, etc.

The Runtime C API

In a typical embedded application use-case, a ML model is managed under the control of the main application, more precisely:

the model is placed in memory (activations, weights, heap)
the model is given inputs
the model is run
the outputs are recovered by the main application for further processing

We propose a slim runtime API for developing the embedded standalone ML applications drawn from our experience with the X-CUBE-AI and the TensorFlow Lite for Microcontrollers tools. The objectives being:

Efficient implementation in terms of performance and minimalist memory footprint.
Memory management under the control of the main application. For example, any runtime memory allocations can be avoided by statically placing data to appropriate memory regions at link time. This enables an easy experimentation with the data placement, and flexibility.
The possibility to build multi-model applications combining separately compiled models. These models can optionally share their activation and/or inputs/outputs memory.
The possibility to include multiple instantiations of the same model in a single application.
Enable a generic main application with all model-specific information available from the model implementation.

Our slim runtime API provides access to the TVM generated model implementation via a small model interface.

First, the ai_model_info descriptor is directly visible from the main application. It holds all information about the model. For example, such information includes the number of model inputs and outputs, associated tensors, their types and shapes, etc. Details are available from this PR. Several models can be linked together into a single application, each one with its own model descriptor.

A model descriptor is instantiated into a deployed model instance by calling the function:

ai_status ai_create (ai_model_info * nn, ai_ptr activations, ai_handle  *handle);

The function returns a particular instance of a model, which is an obscure handle hiding current implementation details. During the ai_create call, the data fields for the activations and params tensors (their buffers addresses) are setup.

The size and memory address of the ‘activations’ and ‘params’ pools can be retrived at runtime with:

uint32_t ai_get_activations_size (ai_handle handle);
ai_ptr ai_get_activations (ai_handle handle);
uint32_t ai_get_params_size (ai_handle handle);
const ai_ptr ai_get_params (ai_handle handle);

We propose to extend the DLTensor with additional quantization information:

typedef struct {
  /*!
   * \brief The TVM tensor.
   */
  DLTensor dltensor;
  /*!
   * \brief The quantization info, if quantized
   */
  ai_quantization_info * quant;
} ai_tensor;

The quantization information is needed by the main application for processing model inputs and outputs. There may be one additional use - debugging/monitoring the intermediate activations, but it is still unclear how useful this can be.

The main application can query a model instance for a number of informations, such as:

int32_t ai_get_input_size (ai_handle handle);
int32_t ai_get_output_size (ai_handle handle);
ai_tensor * ai_get_input (ai_handle handle, int32_t index);
ai_tensor * ai_get_output (ai_handle handle, int32_t index);

etc.

The ai_run function executes the TVM model graph mimiking the GraphRuntime execution:

ai_status ai_run (ai_handle handle);

For the current STM32 target, this is a simple sequential single processor execution that calls each model kernel one at a time.

All API functions return an ai_status value and set the TVMLastError in case of a problem. This can be retrieved by the main application via:

const char * ai_get_error (ai_handle handle);

The above runtime API is more explicit compared to the one proposed by the AoT, which proposes a minimalist runtime C API, consisting mainly of two functions:

// Helper function to initialize a DLTensor
DLTensor TVMInitializeDLTensor(void *data, DLDataType* dtype, DLContext* ctx, int64_t* shape, int64_t num_dim);
 
// Helper function to run the `run_func` within the generated library network.o.
tvm_crt_error_t TVMRuntime_Run(tvm_model_t *model, DLTensor *inputs, int num_inputs, DLTensor *outputs, int num_outputs);

We make several observations:

Full information about the compiled model is not availale.
Some useful functionalities are missing, for example, the input/output quantization information.
The memory allocator manager is not under the main application control. In the embedded development flow this is a critical point - the memory management is typically handled by the main application.
The inputs/outputs buffers cannot be shared with the activations storage, which can be important for memory footprint reduction for small models.

In both RFCs, the model implementation is exposed to the main applicationthe via a slim API layer. However, this RFC API is richer giving more flexibility, in particular for the memory management. Another minor difference is that we do not create or manage model tensors, they are built-in with the model implementation. However, the API provides the main application with functions for accessing these tensors and managing their storage.

Example

ai_handle handle;  /* instance of the model */
ai_ptr data_in;    /* reference for the input buffer */
ai_ptr data_out;   /* reference for the output buffer */

void ai_init(void)
{
  /* AI associated Configuration */
  ...
  /* discover an AI model from current application */
  ai_model_info *nn = ...
  /* 
   * ai_create calls model-specific create function.
   */
  err = ai_create(nn, AI_MODEL_activations(nn), &handle);
  if (err != AI_STATUS_OK) {
    ...
  }
  /* handle is globally set, if no error */

  /* 
   * Allocate input/output tensors
   */

  /* sanity IO number check */
  if (ai_get_input_size(handle) != 1 || 
     ai_get_input_size(handle) != 1)
     return -1;

  DLTensor *dl_tensor;
  
  ai_tensor *input_tensor = ai_get_input(handle, 0);
  dl_tensor = get_dltensor(input_tensor);
  /* built-in allocated tensor? */
  if (dl_tensor->data == NULL) {
    uint32_t bytes = get_tensor_size (input_tensor);
    dl_tensor->data = (ai_ptr)malloc(bytes);
  }
  data_in = dl_tensor->data;

  ai_tensor *output_tensor = ai_get_output_size(handle, 0);
  dl_tensor = get_dltensor(input_tensor);
  if (dl_tensor->data == NULL) {
    uint32_t bytes = get_tensor_size (output_tensor);
    dl_tensor->data = (ai_ptr)malloc(bytes);
  }
  data_out = dl_tensor->data;

  return;

void ai_deinit() {
  /* release the allocate resources (if necessary) */
  ...
  /* deallocate the model instance */
  err = ai_destroy(handle);
  if (err != AI_STATUS_OK) {
    ...
  }
}

int main(void)
{  
  /* MCU Configuration */
  ...
  /* Model Init */
  ai_init();

  /* Main process loop */
  while (cond) {
    /* 1 - Acquire, pre-process and fill the input buffers */
    acquire_and_pre_process_data(data_in);

    /* 2 - Call inference engine */
    err = ai_run(handle);
    if (err != AI_STATUS_OK) {
      ...
    }
    /* 3 - Post-process the predictions */
    post_process(data_out);
  }

  ai_deinit();

}

Relation to the Project API RFC

This RFC has two components:

The STM32 code emitter and its associated runtime support described above
The STM32 demo application

The first component, the STM32 code emitter and its runtime, belongs to the compiler system (TVM) rather then to a separate standalone project. The code emitter takes a TVM Module and generates a C implementation of the model graph. It is tightly-coupled to the TVM code base. The code emitter is also dependent on a particular runtime support, similarly to a C compiler, eg. gcc based on gcc runtime libraries. Preferably, the objective here would be to have a generic runtime API that fits different target platforms and deployment scenarios, while the implementation would be target-specific (similar to the GraphRuntime). However, we can imagine a variety of deployment scenarios and execution models, which may require different runtime APIs. This point is still to be clarified.

The second component, the STM32 demo application, fits well with the Project API proposal, roughly following the ‘Standalone Demo Project Generator’ flow. It may be considered as implementing two of the Project API building blocks:

A project template
A transport layer

The demo application can be eventually integrated with the Project API, as well as within the upcoming AutoTuning infrastructure.

Conclusion

In this RFC we outlined a proposal for the standalone code generation for ML models in embedded and bare-metal development environments. A PR targeting the STM32 microcontrollers is also available. The proposal falls in the line of developments already underway in the TVM community:

AoT code generation: We propose a complimentary, more lightweight approach. A C code for the model is generated enabling standard embedded development flow. We expose more model information to the main application compared to the AoT. Out lightweight approach can be used to quickly develop standalone code generators for new targets.
Embedded Runtime C API: We propose a richer application API compared to the AoT, based on our experience with an industrial embedded development environment.
Project Integration: We propose a STM32 demo application that has been tested on a number of ML models with the STM32 Discovery ARM based development board. We propose to contribute several building blocks that can be integrated with the framework from Project API.

Please share your thoughts/feedback!

areusch · March 30, 2021, 3:46am

hi @stoa,

Thanks for the elaborate RFC here! You bring up a bunch of great points.

This is a really strong proposal and I think overall fairly well aligned with the direction I want to take microTVM. Particularly since similar code has been posted to the forum before, it would be great to have a discussion around the implementation details here.

For the purposes of discussion, let’s break this proposal apart into pieces:

P1. Code Emitter (e.g. Executor implementation or GraphRuntime replacement)

P2. Tensor memory allocation

P3. The firmware-facing API

Finally, I’d like to discuss ways to reduce code duplication and avoid splintering the overall design of µTVM. In particular, it seems like this could become a Project API implementation. I’ll leave some thoughts below on each piece.

Code Emitter

This approach is similar to some others posted to the forum before:

In general, I think the direct-to-C++ route (as compared with the TIR route) is simple and easy to hack on, but the TIR route lends us more avenues for graph-level optimization. However, I don’t think that the accessibility should be understated–tvm has a pretty steep learning curve. I think the challenge with checking this code into the TVM repository is testing and maintenance, as I’ll discuss later.

Tensor Memory Allocation

This looks very similar to what I’d propose we implement in the TIR-based GraphPlanMemory pass. A couple of thoughts:

Does your approach handle workspace memory, allocated inside kernels (e.g. TVMBackendAllocWorkspace)?
Could you say more about " it may be necessary that two models share their ‘activation ’ pools?" Are these separate instances of the same model or two different models?

Firmware-facing API

TVM does have a standard object-oriented Module-based Model Runtime Interface RFC. This one is based around our PackedFunc concept, heavily used in the C++ runtime as a lanugage-agnostic abstraction. In firmware we certainly don’t need such an abstraction. Somewhat related, issue 7596 is considering how to implement PackedFunc calls in the C backend.

Next, I agree that the C runtime API isn’t very friendly for firmware developers. There are a couple pieces here:

PackedFunc are looked-up by string name. This is inefficient in terms of both memory and runtime. I think we still need to maintain that string lookup to keep compatibility with the RPC server implementation which drives autotuning. However, I wonder if we might consider making it a convention to implement PackedFunc with particular symbol names so that they could be called directly in production without string lookup.
Arguments and return values need to be wrapped in TVMValue. I don’t think we can get around this one, but we could implement wrappers to the firmware-facing executor functions to simplify this.

I wonder if there are other differences or critiques you could find of the C runtime that would improve it? It would be great to at least standardize the runtime between these two implementations. This would be in a follow-on RFC, though.

Code Emitter vs TIR-based approach

Given that a number of features implemented in this RFC are on the µTVM roadmap (but intended to be implemented at the TIR level), I think the main difference in the long run here is that this RFC directly generates C++ code rather than passing TIR to the c backend. I think there are merits to both this approach and the TIR-based AOT being implemented by @giuseros.

As discussed in Code Emitter section, I do think that the TIR-based approach gives us more future avenues to develop µTVM. However, I don’t want to ignore how accessible approaches like these are.

Relative to main right now, this RFC has a bunch of things that we don’t have: AOT, memory pinning, API changes. It seems like we could allow an implementation like this to coexist as a Project API with roughly these steps:

Rework the PoC to consume Model Library Format and implement the Project API. Regarding the question of whether this should be applicable to autotuning or also to deployment: my thought was that this would be decided by the project API implementation (either create an option or a separate implementation for each scenario).
When available–use the TIR-based comprehensive memory planner (it seems nearly identical to the one you’ve implemented, and would generate JSON describing the memory pools).
Ensure at least the TVMBackend* functions are used from the C runtime, which provides a pathway to migrate to the TIR-based memory planner and avoids diverging too far in terms of generated code.

Finally, I’d also propose we consider simplifying the C runtime API as discussed in Firmware-facing API section.

Testing and Code Location

Could you speak a bit more to how this code could be tested in the TVM CI? That’s my chief concern with checking it in as a Project API implementation. I posted some thoughts about the bar to checking in Project API implementations to the tvm repo.

Some discussion points:

D1. Between this approach and a TIR-based AOT, do you guys have a preference which you would prefer to work with, assuming both were implemented?

D2. While the Python APIs are perfectly fine, one goal of Model Library Format is to enable downstream tools such as this to work with TVM with less API drift. Do you guys prefer the Python API, or would this also be an interface you’d be open to consuming?

D3. In general, the challenge with checking code such as this into the TVM repo is testing. Particularly with bare-metal code, it’s hard to test without hardware in the loop, and the TVM CI doesn’t really have a provision for that now. Do you guys have a proposal how we might test this code?

stoa · March 31, 2021, 11:37am

Hello Andrew, @areusch.

Thanks for the very good feedback. Below, I am answering your questions and there are a few questions of my own. At the end I have tried to summarize the possible way for moving forward together.

Code Emitter

This approach is similar to some others posted to the forum before:

µTVM Static Code Generator 1 by @r.stahl

my hack to do this 1

Did not see this post before. Same ideas, the API does not seem sufficiently elaborated. We would prefer to see such a tool bundled with the micro TVM instead of being a project apart.

In general, I think the direct-to-C++ route (as compared with the TIR route) is simple and easy to hack on, but the TIR route lends us more avenues for graph-level optimization.

I totally support this.

Does your approach handle workspace memory, allocated inside kernels (e.g. TVMBackendAllocWorkspace)?

In current implementation these allocations are placed on the heap (via a malloc). Work in progress is underway to redirect to a special section - similar to what you had in the older version of MicroTVM, the ‘workarea’, which can then be placed anywhere that the application wants via a linker script.

Could you say more about " it may be necessary that two models share their ‘activation ’ pools?" Are these separate instances of the same model or two different models?

Two different models may be deployed simultaneaously in a target but do not necessarily run in parallel. In this case, one ‘activation’ pool can be allocated instead of two (of course big enough to accomodate the larger of the two models).

On the other hand, two separate instances of the same model can share a single ‘activation’ pool (built-in, for example), or the application can allocate two different ‘activation’ pools, one per instance, if the two instances need to be run in parallel.

Firmware-facing API

This is an important point that need a clear understanding and convergeance.

TVM does have a standard object-oriented Module-based Model Runtime Interface 1 RFC

The Module based Model Runtime Interface discussion opens these questions:

D1: do you like the factory pattern, shall we always require a model name field (and allow “default”), or shall we take the alternative API specialization approach.

The main discussion point here is the application interface for deploying and using the packaged model. The packaging itself is well addressed by the Model Library Format RFC (see below). The factory pattern aims at minimize the API divergence for different deployment scenarios. The arguments for enforcing the generic factory pattern seem to be these:

To have the same mechanism for packaging and loading.
To let the users learn as little as possible.

From the two alternatives, we would prefer the API specialization for the micro TVM. In case of embedded ML there already exists an established API, such as the X-CUBE-AI or the TensorFlow Lite for Microcontrollers, the NXP tools expose a similar API as well; therefore aligning the micro TVM API to the GraphRuntime is less relevant since users are already familiar with these embedded APIs. Specializing micro TVM API also works well with the Project API concept.

This said, our C runtime can also go with the factory pattern. In particular, we have the ‘model descriptors’ that can be “loaded” at runtime and they carry all necessary “meta”-information from each model. Based on this, the factory pattern could be implemented. However, given that we are in C, not C++, this will be special in terms of the API and syntax, therefore does not seem to make sense.

D2: set/run/get interface and predict set interface is useful to allow users to set parameters during runtime. run is useful to do fine grained benchmarking predict is a more high level user friendly API, note that we still want to allow destination passing style(pass out) to allow more flexibility. predict forces us to enable runtime tuple support in the case of multiple output, while get_output keep things simple and minimum.

We prefer align on the current industry “standard” API.

@areusch, concerning the two points that you raised:

PackedFunc are looked-up by string name. This is inefficient in terms of both memory and runtime. I think we still need to maintain that string lookup to keep compatibility with the RPC server implementation which drives autotuning. However, I wonder if we might consider making it a convention to implement PackedFunc with particular symbol names so that they could be called directly in production without string lookup.

If I understand right, the main application must be able to lookup operator functions via their string names. This can be implemented by providing an additional API method with the C runtime. Since it will be used with autotuning, we probably do not care as much for the performance of the string lookup and can allow the string compare, for example. Perhaps I did not get the point ?

Arguments and return values need to be wrapped in TVMValue. I don’t think we can get around this one, but we could implement wrappers to the firmware-facing executor functions to simplify this.

I am not sure I understand the issue. Can we elaborate ?

I wonder if there are other differences or critiques you could find of the C runtime that would improve it? It would be great to at least standardize the runtime between these two implementations. This would be in a follow-on RFC, though.

Summarizing my comments above, we would go for a specialized API for the micro TVM deployment. We would prefer alignment with the APIs used currently by the embedded industry over the alignment with the GraphRuntime API.

Code Emitter vs TIR-based approach

From our prospective, the TIR based implementation is preferable and when it is possible, we would like to move our code emitter there.

Rework the PoC to consume Model Library Format and implement the Project API. Regarding the question of whether this should be applicable to autotuning or also to deployment: my thought was that this would be decided by the project API implementation (either create an option or a separate implementation for each scenario).

Agree, we are looking into this. See a few questions below in Testing and Code Location.

When available–use the TIR-based comprehensive memory planner (it seems nearly identical to the one you’ve implemented, and would generate JSON describing the memory pools).

We thought that the ‘storage_id’ carried the results of the memory planner. Is there another machanism ? Agree on this point as well.

Ensure at least the TVMBackend* functions are used from the C runtime, which provides a pathway to migrate to the TIR-based memory planner and avoids diverging too far in terms of generated code.

Tell me if this is what you meant ?

One important point from our implementation is that the memory is managed by the application via whatever method the application may choose. The C runtime does not perform any memory allocations (no TVMBackendAlloc or TVMBackendFree). As it is, our runtime does not provide memory allocation methods but if there is a reason to do that (some sort of TVM storage), it can be hooked to the TVMBackend* functions. The C runtime does use the TVMbackendLastError.

Finally, I’d also propose we consider simplifying the C runtime API as discussed in Firmware-facing API section.

Are their particular simplification points that you have in mind ?

Testing and Code Location

Could you speak a bit more to how this code could be tested in the TVM CI?

Good question. The demo application cannot be tested in hardware without an available board. However, we can provide a sanity check for the generated C code and the runtime layer that can be built on the host (x86). This way, the code emitter and runtime will be tested, but not the on-the-board application.

As for the code location, the demo application is intended for the STM32 users to start on TVM (as a company we distribute the CubeMX solution whith eventually the TVM integrated inside). A separate CubeMX projects will most probably also exist but I think it is important to have a clear demo project in a spirit of the TVM (not hidden inside the CubeMX tool). We would go with apps/microtvm/no-ci or apps/microtvm with an x86 sanity check CI.

We need to statuate on this. What is your preference ?

D1. Between this approach and a TIR-based AOT, do you guys have a preference which you would prefer to work with, assuming both were implemented?

Normally, the TIR-based AoT is to prefer. But, as I have mentioned in the post, we may end up with several AoTs for different targets. Would this be in line with what is intended in micro TVM ? How quick can we move onto this framework ? @giuseros

D2. While the Python APIs are perfectly fine, one goal of Model Library Format is to enable downstream tools such as this to work with TVM with less API drift. Do you guys prefer the Python API, or would this also be an interface you’d be open to consuming?

From what I understand, the Model Library Format is intended as a deployment format in TVM. So I see how it makes sense for the code emitter to generate the Model Library Format and transmit it to the external project. Of course, the code emitter could also consume the Model Library Format, but this seems less appropriate to us.

If we admit that the code emitter generates the Model Library Format, there are a couple of things that need to be clarified:

The Model Library Format looks like a draft proposal (correct me if I am wrong here). Do we have a more formal document describing the format? For eample, what are contents of the runtime-config/aot ?
The host vs target-key : I imagine that in the STM32 case, the generated sources, the network.c, the operators.c go to the host/src directory, right ? We also generate the network_data.c with params. I’d propose to place this with the host/src sources as well.
The generated C code targets a standalone runtime API, which is different compared to the TVM-built GraphRuntime API from the crt. Should we populate the ‘crt’ with the standalone C runtime code instead ? Minor point: the Makefile is not generated by the standalone code emitter since included from external project.

D3. In general, the challenge with checking code such as this into the TVM repo is testing. Particularly with bare-metal code, it’s hard to test without hardware in the loop, and the TVM CI doesn’t really have a provision for that now. Do you guys have a proposal how we might test this code?

As I explained earlier, we will put in place a minimal sanity testing of the generated C model and its runtime on the CI host.

In addition, we work with the Linaro foundation and they have a farm of HW boards they use for their CI. Linaro are also looking into the micro TVM and it seems reasonable to try finding a common ground where the TVM could use the Linaro infrastructure for the micro TVM development. I am adding @vinceab, our Linaro person to this thread.

Summary

In order to move forward, let’s statuate on these points:

Do we agree on our position for the C runtime API ? Any particular points on C runtime API simplification/additions/improvements.
We need to understand this point:

Ensure at least the TVMBackend* functions are used from the C runtime …

We need to understand the memory planner point:

When available–use the TIR-based comprehensive memory planner (it seems nearly identical to the one you’ve implemented, and would generate JSON describing the memory pools).

Which directory the demo example will live in given the proposal here ?
Clarify my questions w/r to the Model Library Format above. Do we have a formal specification of the Model Library Format ?
Look into the testing on the board possibilities. This may be done offline.

areusch · April 1, 2021, 5:42am

hey Arthur @stoa,

Great, here are some follow-ups:

Shared activation pools

Could you say more about " it may be necessary that two models share their ‘activation ’ pools?" Are these separate instances of the same model or two different models?

Two different models may be deployed simultaneaously in a target but do not necessarily run in parallel. In this case, one ‘activation’ pool can be allocated instead of two (of course big enough to accomodate the larger of the two models).

On the other hand, two separate instances of the same model can share a single ‘activation’ pool (built-in, for example), or the application can allocate two different ‘activation’ pools, one per instance, if the two instances need to be run in parallel.

cool, this makes sense to me. so the memory-pinning implementation will need to perhaps export custom data types or at least memory pool sizing information to make this feasible.

Firmware-facing API

The main discussion point here is the application interface for deploying and using the packaged model. The packaging itself is well addressed by the Model Library Format RFC (see below). The factory pattern aims at minimize the API divergence for different deployment scenarios. The arguments for enforcing the generic factory pattern seem to be these:

To have the same mechanism for packaging and loading.

To let the users learn as little as possible.

From the two alternatives, we would prefer the API specialization for the micro TVM. In case of embedded ML there already exists an established API, such as the X-CUBE-AI or the TensorFlow Lite for Microcontrollers, the NXP tools expose a similar API as well; therefore aligning the micro TVM API to the GraphRuntime is less relevant since users are already familiar with these embedded APIs. Specializing micro TVM API also works well with the Project API concept.

This said, our C runtime can also go with the factory pattern. In particular, we have the ‘model descriptors’ that can be “loaded” at runtime and they carry all necessary “meta”-information from each model. Based on this, the factory pattern could be implemented. However, given that we are in C, not C++, this will be special in terms of the API and syntax, therefore does not seem to make sense.

In my mind, some setup function is needed to accomplish:

initializing memory set aside for tensors and parameters
configuring accelerators, including starting (possibly) backgrounded transfers of any programming/parameters.

I think that the TVM function for this is the factory function (right now, typically mod["default"]()), and the X-Cube equivalent is ai_[<model_name>_]create. Does that match your understanding?

Apologies, I think I was a bit confused before. IIUC, I think this port aims to implement an API aligned with the X-Cube API, at least for now only aiming to enable deployments to STM32–does that also seem right to you? I’m curious whether this API aims to replace the C runtime and Model-based Module Runtime Interface for all targets or if this would just be confined to STM32 for now.

Then the next questions I have would be around how you’d like to proceed with this going forward. At present, the STM32 generator PR you’ve proposed has several features that are missing from the microTVM compiler (e.g. memory pinning, AOT, etc). As we implement these features, will it be possible to incorporate them into this generator as well (I.e. to take advantage of compiler-level improvements we might be able to make, such as graph-level optimization)?

If so, it would be great to keep the STM32 API semantically similar to the TVM C runtime API, so that we can later invoke TVM C runtime APIs from the STM32 functions. I suspect these are pretty similar, but just want to understand the goals for code-reviewing your PR. One possible scenario is: when we have a TVM AOT runtime and memory pinning available, we could rework ai_create to instantiate the TVM C AOT runtime. It would also be great to use the STM32 API as inspiration to expand the TVM APIs to provide equivalent functionality. Please let me know your thoughts here!

PackedFunc are looked-up by string name. This is inefficient in terms of both memory and runtime. I think we still need to maintain that string lookup to keep compatibility with the RPC server implementation which drives autotuning. However, I wonder if we might consider making it a convention to implement PackedFunc with particular symbol names so that they could be called directly in production without string lookup.

If I understand right, the main application must be able to lookup operator functions via their string names. This can be implemented by providing an additional API method with the C runtime. Since it will be used with autotuning, we probably do not care as much for the performance of the string lookup and can allow the string compare, for example. Perhaps I did not get the point ?

I think you mostly got it. Another clarification: while we haven’t seen much of this yet in microTVM, when multiple TVM runtime Modules are present (e.g. BYOC is such a case in C++), the calling convention between both modules is PackedFunc. You see this today in that all generated operators have the TVMBackendPackedCFunc signature.

Technically in the C++ runtime, when a generated operator impl wants to call a PackedFunc from the same runtime Module, it’s supposed to invoke TVMBackendGetFuncFromEnv to do a string lookup of the function. This allows, in the C++ runtime, accelerator control to be done from Python, C++, etc. In the C runtime, I think this is overkill and we should just ensure there is a standard global symbol registered and call it–however, we need to qualify such a symbol with the module name (e.g. the model name, same thing passed to runtime factory). Such a change would need an RFC, so we haven’t 100% gone down this path.

In practice today, an easy workaround is to use Tensorization or to have BYOC emit tir.call_extern nodes, which bypass the string lookup and directly call a function. But, then those BYOC compilers are responsible for the calling convention, a minor nuisance.

Arguments and return values need to be wrapped in TVMValue. I don’t think we can get around this one, but we could implement wrappers to the firmware-facing executor functions to simplify this.

I am not sure I understand the issue. Can we elaborate ?

I think this will likely actually not apply here anymore, but just saying that to call any of our PackedFunc from C, the pattern right now is to use the infrastructure in packed_func.h (e.g. instantiate a TVMArgs and encode the arguments in that structure). This is really burdensome compared with a normal C function call. Should we consider an improved, standardized C-facing TVM API, I would propose we wrap this detail to hide it from the user.

From our prospective, the TIR based implementation is preferable and when it is possible, we would like to move our code emitter there.

Great!

Code Emitter vs TIR-based approach

When available–use the TIR-based comprehensive memory planner (it seems nearly identical to the one you’ve implemented, and would generate JSON describing the memory pools).

We thought that the ‘storage_id’ carried the results of the memory planner. Is there another machanism ? Agree on this point as well.

When we do tensor pinning, we’ll implement a new memory planner similar to the one you have, I think. We’ll probably keep the storage_id concept, but export additional information (e.g. pool_id to identify which memory pool and offset to identify the start of the tensor data field in that pool). storage_id would continue to identify the shared memory space occupied by 1 or more tensors with disjoint lifecycles.

So my question here is: in the future, woudl you be open to using a TVM-side implementation of a memory-pool, statically-allocated memory planner? I think it sounds like that’d be okay, but just confirming.

Ensure at least the TVMBackend* functions are used from the C runtime, which provides a pathway to migrate to the TIR-based memory planner and avoids diverging too far in terms of generated code.

Tell me if this is what you meant ?

One important point from our implementation is that the memory is managed by the application via whatever method the application may choose. The C runtime does not perform any memory allocations (no TVMBackendAlloc or TVMBackendFree). As it is, our runtime does not provide memory allocation methods but if there is a reason to do that (some sort of TVM storage), it can be hooked to the TVMBackend* functions. The C runtime does use the TVMbackendLastError.

Yeah roughly that seems to match what I was implying. When we do tensor pinning, I think it’s likely I’ll propose to add some tensor_id (note: different from storage_id, as storage_id could contain multiple tensor_id) to TVMBackendAllocWorkspace, and a lookup table could just return a pointer into the pre-allocated memory pool. TVMBackendFreeWorkspace would become a no-op. Will that work for you guys?

Finally, I’d also propose we consider simplifying the C runtime API as discussed in Firmware-facing API section.

Are their particular simplification points that you have in mind ? Mainly:

consider removing the need to use PackedFunc looked-up by string name, and instead provide more natural C wrappers around those functions
consider creating a mapping from PackedFunc string name to a global symbol name to shortcut this lookup, as they won’t likely be dynamically overridden in embedded applications.

Testing and Code Location

Could you speak a bit more to how this code could be tested in the TVM CI?

Good question. The demo application cannot be tested in hardware without an available board. However, we can provide a sanity check for the generated C code and the runtime layer that can be built on the host (x86). This way, the code emitter and runtime will be tested, but not the on-the-board application.

As for the code location, the demo application is intended for the STM32 users to start on TVM (as a company we distribute the CubeMX solution whith eventually the TVM integrated inside). A separate CubeMX projects will most probably also exist but I think it is important to have a clear demo project in a spirit of the TVM (not hidden inside the CubeMX tool). We would go with apps/microtvm/no-ci or apps/microtvm with an x86 sanity check CI.

We need to statuate on this. What is your preference ?

This would be fantastic. Would it be possible to checkin a docker container e.g. tlcpack/ci-stm32 which could run this in our CI? Then we can just make it a first-class example and place in apps/microtvm/stm32 or a similar sub-directory of microtvm of your choosing.

Questions

From what I understand, the Model Library Format is intended as a deployment format in TVM.

It’s more intended to be a sort of API between TVM and project generators such as this. You can think of it as the data structure used in Project API to transmit the generated model (and potentially runtimes in the case of AOT) to the code emitter, which would be implemented by the Project API. It’s not so much intended that projects would deploy it as is, but use the on-disk locations as standard places from which to consume the various artifacts from TVM. The reason it’s on disk rather than in-memory is so it can be easily exported to a user for debugging (e.g. what did TVM do with my model?) and also for unit/integration testing TVM and BYOC.

The Model Library Format looks like a draft proposal (correct me if I am wrong here). Do we have a more formal document describing the format? For eample, what are contents of the runtime-config/aot ?

In TVM we tend to come to lazy-consensus. The implemented format is about the same as that proposal, with an exception that I pulled out the C runtime (it’s supplied separately in the Project API). You’re right though that it’s very new and we need more documentation. There are a couple ways we will address this:

We have a new RFC process we’re adopting, which will place adopted RFCs in that repo (and I would then update the RFC to match as-built).
We’ll create documentation on https://docs.tvm.ai when the Project API + Model Library Format both land in main.

The host vs target-key : I imagine that in the STM32 case, the generated sources, the network.c , the operators.c go to the host/src directory, right ? We also generate the network_data.c with params. I’d propose to place this with the host/src sources as well.

For now, host is the only target-key. I have to make another RFC about this, but here is a sketch of my thinking–sorry, this part is still a bit rough.

The idea is that accelerator (or coprocessor with different architecture) could have different target-key. Programmable accelerators would get 1 target-key per program (but a program may live on many instances–maybe you have 3 instances with a conv2d program and 2 more with a maxpool program; in this case, 2 target-key would exist, e.g. accel-conv2d and accel-maxpool). These directories could also contain host-executed control source code (e.g. the PackedFunc or extern func to launch compute), but that code would be dedicated to operating those accelerators.

The generated C code targets a standalone runtime API, which is different compared to the TVM-built GraphRuntime API from the crt . Should we populate the ‘crt’ with the standalone C runtime code instead ? Minor point: the Makefile is not generated by the standalone code emitter since included from external project.

Actually I deleted that crt directory in final impl–sorry, this was very not clear from that RFC. I’ll update the thread to clarify. I think given my comment above about consuming the Model Library Format, you won’t need to worry about populating it.

As I explained earlier, we will put in place a minimal sanity testing of the generated C model and its runtime on the CI host.

In addition, we work with the Linaro foundation and they have a farm of HW boards they use for their CI. Linaro are also looking into the micro TVM and it seems reasonable to try finding a common ground where the TVM could use the Linaro infrastructure for the micro TVM development. I am adding @vinceab, our Linaro person to this thread.

Great, I think that sounds quite reasonable. We aren’t likely going to be able to put non-cloud hardware-in-the-loop for the TVM CI, but having a nightly should hopefully suffice.

I think this should answer most of your questions–let me know if I’ve missed any! This seems like a great direction for microTVM.

Andrew

stoa · April 1, 2021, 11:33am

Hello Andrew @areusch

In my mind, some setup function is needed to accomplish:

initializing memory set aside for tensors and parameters

configuring accelerators, including starting (possibly) backgrounded transfers of any programming/parameters.

I think that the TVM function for this is the factory function (right now, typically mod"default"), and the X-Cube equivalent is ai_[<model_name>_]create. Does that match your understanding?

This is exact.

Apologies, I think I was a bit confused before. IIUC, I think this port aims to implement an API aligned with the X-Cube API, at least for now only aiming to enable deployments to STM32–does that also seem right to you? I’m curious whether this API aims to replace the C runtime and Model-based Module Runtime Interface for all targets or if this would just be confined to STM32 for now.

If I am ambitious, I would say replace for a family of embedded targets. Sorry, I perhaps, was not clear earlier. We have observed several embedded tools converged on such API:

That seems a good argument to try also aligning the TVM C API in this direction. We probably need to change the naming, perhaps have tvm_ai_ instead of just ai_ - this is a detail. Important point is that there is a dozen of methods common to the above APIs and that the memory management is left to the main application to handle. I propose to start with the STM32 code emitter now and work together with the TIR-based AoT on converging to a common understanding. This will pave the way for us to move to the TIR-based code generator. We can perhaps also contribute to its development.

Then the next questions I have would be around how you’d like to proceed with this going forward. At present, the STM32 generator PR you’ve proposed has several features that are missing from the microTVM compiler (e.g. memory pinning, AOT, etc). As we implement these features, will it be possible to incorporate them into this generator as well (I.e. to take advantage of compiler-level improvements we might be able to make, such as graph-level optimization)?

This would be the plan. I can imagine a couple of things we can do with the TIR-based AoT that we cannot with our current code emitter.

If so, it would be great to keep the STM32 API semantically similar to the TVM C runtime API, so that we can later invoke TVM C runtime APIs from the STM32 functions. I suspect these are pretty similar, but just want to understand the goals for code-reviewing your PR. One possible scenario is: when we have a TVM AOT runtime and memory pinning available, we could rework ai_create to instantiate the TVM C AOT runtime. It would also be great to use the STM32 API as inspiration to expand the TVM APIs to provide equivalent functionality. Please let me know your thoughts here!

Corresponds entirely to our vision. Great !

So my question here is: in the future, woudl you be open to using a TVM-side implementation of a memory-pool, statically-allocated memory planner? I think it sounds like that’d be okay, but just confirming.

Yes. We will move away from the JSON graph and base the code emission on the TIR-based TVM structures, including the memory planner.

When we do tensor pinning, I think it’s likely I’ll propose to add some tensor_id (note: different from storage_id, as storage_id could contain multiple tensor_id) to TVMBackendAllocWorkspace, and a lookup table could just return a pointer into the pre-allocated memory pool. TVMBackendFreeWorkspace would become a no-op. Will that work for you guys?

That is good. Just keep in mind that these memory pools should be open to a static allocation as a section via a link script, to a static allocation as a table from the main application (.data), and to the dynamic allocation via whatever allocator the application may choose.

consider removing the need to use PackedFunc looked-up by string name, and instead provide more natural C wrappers around those functions

Already the case.

consider creating a mapping from PackedFunc string name to a global symbol name to shortcut this lookup, as they won’t likely be dynamically overridden in embedded applications.

We will add a API method for such lookup implementing the mapping.

Would it be possible to checkin a docker container e.g. tlcpack/ci-stm32 which could run this in our CI? Then we can just make it a first-class example and place in apps/microtvm/stm32 or a similar sub-directory of microtvm of your choosing.

Yes. Noted.

The Module Library Format seems not fully finalized yet That’s fine. I will generate the structure as per your RFC proposal (no crt), and we can refine it from there. This is a minor detail.

Actions for us:

Re-submit the PR with this:

Move to generating Module Library Format (as it is for now).
Provide the docker and a test application for the sanity CI.
Move to Project API on the demo side (structure + microtvm_api_server.py) implementing the Standalone Demo Project Generator based on your PoC.

We continue discussion on the C runtime API, how to involve the AoT people ? We can contribute to the development if necessary.

Does this work for you ?

Cheers

Arthur

areusch · April 1, 2021, 3:24pm

Great, a few final clarifications.

The Module Library Format seems not fully finalized yet That’s fine. I will generate the structure as per your RFC proposal (no crt), and we can refine it from there. This is a minor detail.

It is somewhat of a living standard, but it’s versioned. If you have tests for your implementation, we will run them as we make changes and bump Model Library Format version.

One clarification we do need to make here: Model Library Format is generated with the function tvm.micro.export_model_library_format, and the generated directory tree is given as an argument in Project API to generate_project. I think you should just need to modify your codegen to consume Model Library Format rather than also making a generator for it. Sorry if that was unclear, and let me know if something seems fundamentally broken with that approach.

Right now, Model Library Format includes graph executor configuration and so suggests the executor that should be used. I think you can just ignore that piece and/or use it to drive your codegen.

With all this said, we just have a PoC of Project API we’re developing now. Currently there is just a demo of an implementation for the host C runtime. The remaining items before committing the PoC are:

Develop the Zephyr API implementation
Migrate apps/bundle_deploy to use Project API

I’ll try to post the Zephyr implementation as a (loose) example (e.g. the Zephyr impl would not do runtime generation nor memory pinning) of what I’m thinking for STM32 codegen by end-of-week. Let me know what you think of this approach. We could expand the content of Model Library Format, if that was necessary for an STM32 implementation.

The benefit of doing this is that autotuning is going to use Project API to drive the build/flash/timing pipeline, so it would be a more natural shift as we move towards that. There is one additional detail not yet ironed out: the code you would want to generate for autotuning is very different from that you’d want to generate for inference. My vision for this was to have two different project generators (e.g. apps/microtvm/stm32/inference and apps/microtvm/stm32/autotune). In this proposal, the inference project would essentially be implemented as you guys have done now, and autotune would need to include the TVM RPC server and logic to drive the RPC transport over e.g. UART, USB, etc.

Let me know what you think of this idea.

I propose to start with the STM32 code emitter now and work together with the TIR-based AoT on converging to a common understanding. This will pave the way for us to move to the TIR-based code generator. We can perhaps also contribute to its development.

Great, that sounds good. Let’s discuss the API convergence in a follow-on RFC. I’m not sure I see exact unification on naming across frameworks, but I agree that the structure of our API is a bit divergent from the other embedded AI platforms. The API change will affect many, so we’ll need to have a focused discussion and loop in quite a few others.

@giuseros @ramana-arm, possible to give an update on the AOT progress?

When we do tensor pinning, I think it’s likely I’ll propose to add some tensor_id (note: different from storage_id, as storage_id could contain multiple tensor_id) to TVMBackendAllocWorkspace, and a lookup table could just return a pointer into the pre-allocated memory pool. TVMBackendFreeWorkspace would become a no-op. Will that work for you guys?

That is good. Just keep in mind that these memory pools should be open to a static allocation as a section via a link script, to a static allocation as a table from the main application (.data), and to the dynamic allocation via whatever allocator the application may choose.

Yeah this is all part of that. In particular, some accelerators may need a subset of parameters to live in a memory pool that lives at a fixed address for faster loading at startup.

consider removing the need to use PackedFunc looked-up by string name, and instead provide more natural C wrappers around those functions

Already the case.

We will add a API method for such lookup implementing the mapping.

Here, my goal is just to implement a simpler code-generation of tir.call_packed nodes which avoids a string lookup at inference time (e.g. avoids calling TVMBackendGetFuncFromEnv to do the string-lookup at inference time).

Actions for us:

Re-submit the PR with this:

Move to generating Module Library Format (as it is for now).

Provide the docker and a test application for the sanity CI.

Move to Project API on the demo side (structure + microtvm_api_server.py) implementing the Standalone Demo Project Generator based on your PoC.

We continue discussion on the C runtime API, how to involve the AoT people ? We can contribute to the development if necessary.

Does this work for you ?

Aside from (1), which I think can be generated with tvm.micro.export_model_library_format, that seems like a great plan to me!

I’ve tagged the AOT implementers, hopefully they can give a status update here.

-Andrew

giuseros · April 1, 2021, 4:49pm

Hi all,

I just published the AOT PR upstream: [AOT] Introducing AOT in TVM by giuseros · Pull Request #7785 · apache/tvm · GitHub.

It has some conflicts probably due to the GraphExecutor refactoring, and I will fix that soon. I wanted just to let you guys start to have a look

@stoa I am wondering how much of your work can use the AOT code generation in that PR.

Thanks, Giuseppe

giuseros · April 1, 2021, 4:41pm

Also, a side comment: I will be out for Easter holidays until Tuesday (so I will be replying back to any comments as soon as I come back )

stoa · April 2, 2021, 4:20pm

Hello, Andrew @areusch

Implementing the Project API, am encountering a couple of issues:

The generate_project script takes one tar ball with a model, multiple models do not seem to be supported. I would propose to add a function add_model(module_library_format_tar) that will add a given model to the project. Or any other solution you may prefer, so that a project could include multiple models ?
We have the stm32 runtime API code that is not included with the standalone_crt distribution. How do you propose the project accesses this runtime code ?
I can test the ProjectAPIHandler methods (from the microtvm_api_server.py) directly from a small script, verifying the functionality for generate_project, build, and flash. However, I have not found the project_api in the main branch, or is it ? How do you propose I continue ?

areusch · April 2, 2021, 4:50pm

Hi @stoa,

The generate_project script takes one tar ball with a model, multiple models do not seem to be supported. I would propose to add a function add_model(module_library_format_tar) that will add a given model to the project. Or any other solution you may prefer, so that a project could include multiple models ?

I think for multiple models, we should place them in a single IRModule prior to calling tvm.relay.build. However, we don’t have this well-supported just yet. @tqchen, more thoughts here?

We have the stm32 runtime API code that is not included with the standalone_crt distribution. How do you propose the project accesses this runtime code ?

The idea is to create a “template project” including this stm32 runtime code and microtvm_api_server.py. When generate_project is called, copy both the microtvm_api_server.py and the template code into the new project directory. Let me know if this seems okay to you guys.

I can test the ProjectAPIHandler methods (from the microtvm_api_server.py) directly from a small script, verifying the functionality for generate_project, build, and flash. However, I have not found the project_api in the main branch, or is it ? How do you propose I continue ?

Yeah sorry–I am just working on the Zephyr implementation myself. I hope to land my branch in the next week or two. Is that timeline ok for you?

Andrew

stoa · April 2, 2021, 9:32pm

I think for multiple models, we should place them in a single IRModule prior to calling tvm.relay.build . However, we don’t have this well-supported just yet. @tqchen, more thoughts here?

I would not impose that multiple models must be compiled together. Of course, compiling models together has the advantage of ‘inter-model’ optimizations, whaever this may be (sharing operators ?). On the other hand, there may be advantages to compiling models separately and reuse the results in different contexts/projects. I do not see a good reason for completely disallowing such separate compilation. Are there any ? Arthur

areusch · April 2, 2021, 11:32pm

I do not see a good reason for completely disallowing such separate compilation. Are there any ? Arthur

No there’s no problem with the idea of compiling them separately. We’d just need to make some changes to the compiler (e.g. allow exporting multiple top-level modules at once with Model Library Format). I think placing everything in one IRModule just requires the least hacks now. I’m not opposed to enabling multi-model compilation in TVM–just needs someone to put some cycles into it. I think this is mainly identifying the entry points in tvm.relay.build and doing some analysis to ensure codegen’d functions are disjoint.

We should spin up another RFC thread to discuss changes needed for that, if that’s something you’re interested in contributing!

-Andrew

manupa-arm · April 6, 2021, 11:39am

I’ve been out for for Holidays and apologize for catching up on this a bit later. Thanks @stoa for the proposal and for the discussion.

I’m interested in the memory management aspect of the RFC here.

We propose to leave a full freedom of memory management to the main application (no TVM integrated memory manager). This will enable standard and familiar memory management techniques, such as using linker scripts, for example. Another existing project that follows this direction is the project to estimate the memory footprint of the graph from TVMC ÂµTVM M2 Roadmap .

areusch:

When we do tensor pinning, I think it’s likely I’ll propose to add some tensor_id (note: different from storage_id, as storage_id could contain multiple tensor_id) to TVMBackendAllocWorkspace, and a lookup table could just return a pointer into the pre-allocated memory pool. TVMBackendFreeWorkspace would become a no-op. Will that work for you guys?

That is good. Just keep in mind that these memory pools should be open to a static allocation as a section via a link script, to a static allocation as a table from the main application (.data), and to the dynamic allocation via whatever allocator the application may choose.

Yeah this is all part of that. In particular, some accelerators may need a subset of parameters to live in a memory pool that lives at a fixed address for faster loading at startup.

@areusch @stoa based on the discussion that happened here, what is the current thinking as to who would be producing the address offset table in the case where user prefers TVM to figure out offsets from a single memory pool for all intermediary activations ?

Im looking at would there be an additional output to metadata.c/.o to hold the mapping between pinned tensors and their offsets.

Also @stoa, when you say dynamic allocation which granularity are we talking about ?

I mean would the application require to control the allocation of each and individual activation tensor or do you mean to say to decide where each pool (where all intermediary activations are allocated from) to be dynamically/statically allocated.

As a first step, we are starting to look at adding an interface to runtime.Module’s to be able to be queried for their workspace requirement to be consumed by the AoT executor initially. I will post a RFC soon. Is this something you guys have already looked at ?

stoa · April 7, 2021, 3:07pm

@manupa-arm

what is the current thinking as to who would be producing the address offset table in the case where user prefers TVM to figure out offsets from a single memory pool for all intermediary activations ?

I am not sure, I understand this question. Normally, the TVM figures our the tensor allocation inside the activation and params memory pools, currently storage_id.Do you mean something else ?

Im looking at would there be an additional output to metadata.c/.o to hold the mapping between pinned tensors and their offsets.

I feel like I am missing some information here. Can you explain the term “pinned tensor” ?

Also @stoa, when you say dynamic allocation which granularity are we talking about ?

I mean would the application require to control the allocation of each and individual activation tensor or do you mean to say to decide where each pool (where all intermediary activations are allocated from) to be dynamically/statically allocated.

Application should decide on the entire pool. The individual buffer allocation inside the pool is done by the compiler.

As a first step, we are starting to look at adding an interface to runtime.Module’s to be able to be queried for their workspace requirement to be consumed by the AoT executor initially. I will post a RFC soon. Is this something you guys have already looked at ?

This looks like getting the activations pool size and params pool size from the model. Yes.We have such API methods implemented.

At this point, it is difficult for us to evaluate on how close to what we are proposing the future TIR-based AoT might be.

The main point here is the C Runtime API that exposes the model to the main C application. Below, I am listing a few points derived from our experience with embedded ML development. We would like to be able to build a C API on top of the TVM AoT including these:

We expect the following pattern for deploying and running the model:

model_create_instance model_get_inputs model_get_outputs model_run_instance model_destroy_instance

This allows flexible models instantiation and handling. We expect being able to instantiate multiple copies of the same model, therefore some sort of instance handle/pointer will have to be used to access a particular copy.

We expect tensors (at least input/output) be augmented with the quantization information, so that the application can correctly setup their values.
We expect that the main application can setup the activation storage in two ways:

as a static block allocated in a specific ELF section
dynamically via whatever memory allocator is used by the application

In our implementation, we let the code emitter to instantiate the activations pool as a static block. Then we need to know the pool’s address from the model instance. For dynamic allocation, we expect to know the activations pool size from the model.

Input and output tensors may share memory with the activations and be placed with the activation pool. We need to be able to get these tensors addresses via get_input and get_output from the model instance. Application must have also the capability to provide its own buffer with specific HW specific alignment constraints to address the optimized UC (i.e. data produced or consummed by a HW IP, double-buffering scheme,…).
We expect parameters to be allocated as a static block in a specific ELF section.
For advanced debug/profiling purpose, a minimal additional mechanism (available in debug mode if overhead is considered) should be accessible by the application to register an user callback allowing:
- to measure the execution time of a given operator. Registered callback is called before or/and after the execution of the operator. Integrator point of view, the main open point is the identification/mapping of the executed operator vs operator from the “original” model.
- to have the capability to inject or to dump the tensor contents before or/and after the execution of the operator. model_register_callback
We provide a number of other model information useful for debugging and performance measurements: model name, number of operators in the graph, tools/api versions, etc. These are pretty nice to have while they are not difficult to supply.
In our implementation we also provide access to params pool via get_params and get_params_size. This is not complicated to provide while in the future it may be useful if main application needs to manipulate parameters for some sort of transfer learning or what have you.

The efficiency of the AoT generated graph functions is a secondary concern. A couple of points that may be worth mentioning:

It seems preferable to allocate tensors (not their storage) inside some ELF section, perhaps the .data section, not on stack. Usually, an embedded application developers need to size the stack. Having an unknown size bunch of bytes allocated by the AoT generator on stack may be perturbing to the familiar way of doing things. This is a relatively minor point.
The TVMBackendAllocate implementation should not be partt of the AoT. As I have explained, we prefer letting the application to decide on ALL memory allocations. Therefore, we should leave the TVMBackendAllocate implementation to the application code.

Hopefully, these points will help you to improve the AoT code generator.

stoa · April 7, 2021, 3:12pm

@areusch

Hello, Andrew

Let me try to summarize this RFC status:

There is a work on TIR-based AoT underway that covers pretty much what we are proposing.
The Runtime API for the standalone C code generation has not been finalized and is in a sort of open, experimental state.
You prefer to integrate our development as a micro TVM project complying with the Module Library Format and the Project API interfaces.

Moving forward:

The idea is to create a â€œtemplate projectâ€ including this stm32 runtime code and microtvm_api_server.py. When generate_project is called, copy both the microtvm_api_server.py and the template code into the new project directory. Let me know if this seems okay to you guys.

Yeah sorryâ€“I am just working on the Zephyr implementation myself. I hope to land my branch in the next week or two. Is that timeline ok for you?

This RFC proposed to contribute a C code generator and the API that we have developed for the STM32 targets to the ‘main’ TVM development branch. The idea is to have the TVM compiler target the STM32 boards (more boards are coming) and launch the STM32 developers on the TVM based tools. Putting the code emitter and the firmware-side API into a separate (from TVM) “template project” is somehow different from this original intention. More precisely, we want to put in place a compiler flow that can generate ML models implementations for the STM32 targets that our developers can use in their projects. We would prefer not to mix the compilation part together with the application part, as the “template-project” would imply. I can understand how integrating the code emitter with the TVM does not seem useful to you at this point (even as intermediate step while the AoT is not finalized):

considering the upcoming TIR based AoT
considering that the C Runtime API discussion has not been finalized

However, instead of integrating our code with the Project API as is, we prefer to package it together with tools on our side, at least until we can move to the TIR-based AoT. Moreover, it is also preferable for us to wait until the Module Library Format and the Project API mature and make their way to the ‘main’ TVM branch before integrating the STM32 applications. At some point, hopefully, we will be able to switch to the TIR-based AoT - we should keep the C Runtime API discussion open to avoid being too incompatible in terms of the firmware-side API. Then we would also integrate a STM32 project compliant with the TVM micro interface: the Module Runtime Format and the Project API.

What do you think ?

Multiple Models

We should spin up another RFC thread to discuss changes needed for that, if thatâ€™s something youâ€™re interested in contributing!

I cannot engage on this right now. If you launch such an RFC, please put me in CC, we will participate in the discussion.

areusch · April 7, 2021, 5:11pm

hi @stoa,

Thanks for the summary, I think that’s roughly correct. You’re right that things are changing fairly rapidly right now. I think even the Project API PR I sent you had become out of date by the time I sent it to you–so apologies for that.

Moving forward

I think your proposal makes sense–let me suggest a slight tweak to confirm my understanding:

The idea is to have the TVM compiler target the STM32 boards (more boards are coming) and launch the STM32 developers on the TVM based tools. Putting the code emitter and the firmware-side API into a separate (from TVM) “template project” is somehow different from this original intention.

So given that the main thing you’re trying to achieve right now is a code generator that produces an STM32-specific API, I can see how the Project API is a bit of a mismatch here. Specifically, you’re not attempting to generate a template project within TVM–it’s more accurate to characterize this as transforming the TVM compiler output to produce an STM32-compatible API.

I think there are two fairly separable pieces to your proposal here:

Adding a code-generator that produces models which implement the STM32 X-Cube AI API (e.g. ai_create, etc).
Reworking the TVM C Runtime APIs to more closely match the STM32 X-Cube API (which matches more closely to APIs from other embedded deployment tools–so therefore a direction in which microTVM should consider moving).

I think that piece #1 is fairly uncontroversial, and we’ve resolved the main challenges there (e.g. testing). Piece #2 will take longer, and more impacts the scope of the initial effort. Given the amount of development in progress now, it’ll be hard to settle on piece #2 until some of the core improvements (e.g. AOT, memory planning) land. So initially, let’s focus this RFC on merging piece #1.

Along those lines, I wonder if we could take a middle-ground approach here: the Model Library Format piece is merged to main. Is it possible to modify your code-generator to consume Model Library Format rather than using internal TVM APIs directly? If needed, we could make changes to Model Library Format to accommodate this change (e.g. you’ll be the first non-TVM use of it, so it wouldn’t surprise me if some parts need tweaking).

I think this would have some advantages:

It substantially reduces the footprint of your initial commit
It reduces exposure to the internal APIs, which may continue to change as TVM moves towards v1.0
It places platform-specific code behind the Model Library Format data structure, which helps to make sure that Model Library Format provides everything needed for a microTVM platform.
It makes future changes that may impact the STM32 code generator easier to implement e.g. AOT, memory pinning.

One question I have is around project generation, though. I do see that STM32 X-Cube AI supports project generation. From UM2536 section 1.2:

The X-CUBE-AI tool can generate three kinds of projects:

System performance project running on the STM32 MCU allowing the accurate measurement of the NN inference CPU load and memory usage

Validation project that validates incrementally the results returned by the NN, stimulated by either random or user test data, on both desktop PC and STM32 Arm® Cortex®-M-based MCU embedded environment

Application template project allowing the building of AI-based application

So just checking here–it seems like you do have some project generation facility. I could see how you prefer to keep project generation centralized within the larger STM X-Cube tool rather than invoking TVM via Project API. The one question that comes to mind is: do you intend to support autotuning efforts on-device? If so, at some point it’d be good to discuss a way forward to integrate the AutoTVM search tool with STM32 X-Cube project generation.

Other followups

Some additional follow-ups on comments from @manupa-arm and @stoa:

@areusch @stoa based on the discussion that happened here, what is the current thinking as to who would be producing the address offset table in the case where user prefers TVM to figure out offsets from a single memory pool for all intermediary activations ?

This is a great thing to discuss, because this same issue is also present in the AOT PR 7785. I’ll also raise this on the AOT RFC.

To provide some context:

Currently in microTVM, all memory allocation is handled dynamically. We don’t think this approach makes sense for a bare-metal environment–it’s just in there due to historical reasons and limitations on the TVM Graph Memory Planner.
In microTVM M2 Roadmap projects 5 and 7, we plan to overhaul Graph Memory Planner to support (likely) memory pools.
This would allow the user to provide, at the time of tvm.relay.build, a map of the available on-device memory to the TVM memory planner, and e.g. the output of tvm.relay.build will change such that each DLTensor referenced in the graph can be associated with a (memory_pool, offset) pair. Effectively, this “pins” each Tensor to a mostly-predefined location in memory.
This will remove the need for any dynamic memory allocation during inference. It also aligns effectively with what you guys have implemented. The advantage to doing this in TVM’s Graph Memory Planner is support for heterogeneous memory configurations e.g. that might be found with accelerators or multi-core SoC.

Currently in both this PR and in the AOT PR, memory pinning is handled outside the TVM compiler. I think this is a fine approach in the short-term, but we would obviously like to unify with TVM’s memory planner as it becomes sophisticated enough to drive these code generators.

As a first step, we are starting to look at adding an interface to runtime.Module’s to be able to be queried for their workspace requirement to be consumed by the AoT executor initially. I will post a RFC soon. Is this something you guys have already looked at ?

@manupa-arm I think we were planning to handle this by enabling GraphPlanMemory to traverse the whole-program TIR post-scheduling, including the generated AOT TIR. This should allow it to see all tir.allocate nodes and get a global view of the required memory. I think this would avoid us needing to add more compiler-specific stuff to runtime::Module, which will help us in the future.

I think this is a separate discussion than this RFC (but it would be great to get everyone’s input on that RFC).

We expect tensors (at least input/output) be augmented with the quantization information, so that the application can correctly setup their values.

@stoa I’m curious what this is used for specifically–wouldn’t the application already know this e.g. in a hardcoded pre-processing function? Or does this allow the application to implement a generic pre-processing function?

It seems preferable to allocate tensors (not their storage) inside some ELF section, perhaps the .data section, not on stack. Usually, an embedded application developers need to size the stack. Having an unknown size bunch of bytes allocated by the AoT generator on stack may be perturbing to the familiar way of doing things. This is a relatively minor point.

I think TVM has a limit on stack-allocated tensors, but we need to ensure it’s set correctly for µC. Likely, we need to configure this as e.g. a PassContext option.

The TVMBackendAllocate implementation should not be partt of the AoT.

I agree that tensor memory allocation and AoT are two separate things. We need to discuss this before merging AoT.

stoa · April 8, 2021, 12:50pm

@areusch @delorme-jm

Hello, Andrew

We are still discussing the way forward here internally. I do not think I understand how you propose to integrate our work at this point. Below, once more my understanding, thanks for patience

I think there are two fairly separable pieces to your proposal here:

Adding a code-generator that produces models which implement the STM32 X-Cube AI API (e.g. ai_create, etc).

Reworking the TVM C Runtime APIs to more closely match the STM32 X-Cube API (which matches more closely to APIs from other embedded deployment tools–so therefore a direction in which microTVM should consider moving).

I think that piece #1 is fairly uncontroversial, and we’ve resolved the main challenges there (e.g. testing). Piece #2 will take longer, and more impacts the scope of the initial effort. Given the amount of development in progress now, it’ll be hard to settle on piece #2 until some of the core improvements (e.g. AOT, memory planning) land. So initially, let’s focus this RFC on merging piece #1.

That’s clear.

Along those lines, I wonder if we could take a middle-ground approach here: the Model Library Format piece is merged to main. Is it possible to modify your code-generator to consume Model Library Format rather than using internal TVM APIs directly? If needed, we could make changes to Model Library Format to accommodate this change (e.g. you’ll be the first non-TVM use of it, so it wouldn’t surprise me if some parts need tweaking).

The inputs to our code generator do not create a problem: I have alredy experimented with the Model Library Format. The problem that I see is that the code generator itself needs to be placed together with the application project. Precisely:

You are right:

it seems like you do have some project generation facility.

So, our ML development flow is:

                --------------------
  Model -->     |      CubeAI      |
                --------------------
                    |          |
                    V          V
CubeMX Project +  C Code  + runtime  => Target
                           Libraries

For example, the demo that we’ve developed is based on such CubeAI generated project.

Now we are working on integrating the TVM with our CubeAI generator. The input is the model, the output is the C API. Internally, from the CubeAI prospective, the input to the code generator may be anything TVM generates (whether RuntimeModuleFactory or Model Library Format) and it is not visible from the project. Thus, the microTVM project will need to install the CubeAI tools in order to build the model implementation. When either such CubeAI is available, or we move onto the AoT code generator, we can propose a demo project within the microTVM framework. Like I said earlier, we prefer not to make the codegenerator+runtime part of the application project at this time.

I could see how you prefer to keep project generation centralized within the larger STM X-Cube tool rather than invoking TVM via Project API. The one question that comes to mind is: do you intend to support autotuning efforts on-device? If so, at some point it’d be good to discuss a way forward to integrate the AutoTVM search tool with STM32 X-Cube project generation.

Yes, we intend to use the AutoTuning. We have not looked at it closely, yet. I had made it work in our environment with your old microTVM - the host driven AutoTuning. That worked well, by the way. I am speculating here but we may not support user AutoTuning in the CubeAI - we probably will opt for building our AotuTuning database and make it accessible to the TVM via a git repository. The details will be clear when the microTVM autotune is released.

Concerning the quantization info:

I’m curious what this is used for specifically–wouldn’t the application already know this e.g. in a hardcoded pre-processing function? Or does this allow the application to implement a generic pre-processing function?

Basically, yes - it allows the application to not hardcode the quantization information but get it from the model.

It allows generic preprocessing
The more information you can get from the model, the more robust is your code for different models.
During development, it is easier to maintain changing quantization parameters between the quantized model, the python environment, and the C code.

Of course, if you need a really specific pre- or post- processing, the main application needs to be specific. Even in these cases, the quntization does not need to be hardwired. Bottom line: if you make a mistake with model input shape, you may get an error message, while you will just get wrong results if you make a mistake with the quantization parameters.

areusch · April 8, 2021, 3:32pm

@stoa @delorme-jm

Apologies for being unclear earlier, let me try to clarify.

The inputs to our code generator do not create a problem: I have alredy experimented with the Model Library Format. The problem that I see is that the code generator itself needs to be placed together with the application project.

This is where I should clarify–to me, Model Library Format and Project API are two different things:

Model Library Format (merged to main) specifies the layout of a .tar file that contains various parts of the compiled model. You generate Model Library Format with tvm.micro.export_model_library_format and it’s meant for either a) debugging or b) to be consumed by downstream project generators. Such downstream project generators will likely eventually mostly be Project API implementations, but this is not required.
Project API (not yet merged and still rough around the edges, as you rightly assessed) is an abstraction layer that puts typical platform-specific microTVM build tasks behind an API. Those are tasks like generate_project, build, flash, connect_rpc_server. Project API implementations would be typically co-located with application code (this is just a suggested convention, though, it doesn’t have to stick). Project API enables two different workflows:
1. standalone project generation for either deployment, experimentation, or measurement (this is similar to the purposes stated in the X-Cube generator UM2526 doc).
2. autotuning

So it seems to me that a good path forward (while we wait for e.g. AOT, memory planning, and Project API to get merged into main proper) would be to keep your code-generator in a Python script in the TVM repo. I’d suggest you consider having your script consume Model Library Format (which you can generate today at main with tvm.micro.export_model_library_format, rather than directly calling the TVM APIs.

This approach is roughly the same as what you’ve proposed in your PR, with the change that it would consume Model Library Format rather than the output of e.g. tvm.relay.build directly. If you need something more in Model Library Format, let’s just add it, because someone else will likely want it.

I think the main benefits of this are:

it moves your implementation further away from the core APIs, in case they drift
it benefits Model Library Format, as it would help to identify any shortcomings with the format (e.g. if it’s missing something you’d like, I think we should just add it).
if you decide to use the generic microTVM autotuning driver (e.g. from PR 7545) later on, you’ll need to make some Project API impl (even if it just shells out to X-Cube to do the actual generation). your Project API impl will receive the model in Model Library Format. So, this should help simplify this effort, as by this point you’d already have a project generator to start from which takes the same input given to you in autotuning.
finally, as we move on to Piece #2 (reworking C APIs to align with your X-Cube APIs), I suspect that having the same data available to all project generators will make that task easier to accomplish.

I think you could either place your code-generator in apps/microtvm/stm32 or in python/tvm/micro/contrib/stm32, even though it won’t look anything like python/tvm/micro/contrib/zephyr.py (we’ll move that into a Project API impl in apps/microtvm/zephyr shortly).

Yes, we intend to use the AutoTuning. We have not looked at it closely, yet. I had made it work in our environment with your old microTVM - the host driven AutoTuning. That worked well, by the way. I am speculating here but we may not support user AutoTuning in the CubeAI - we probably will opt for building our AotuTuning database and make it accessible to the TVM via a git repository.

Glad to hear this worked well. I think I’m also unsure as to whether autotuning would be an SoC vendor thing or an end-user thing. I’d still like to improve the autotuning infrastructure to make it easier to use–that benefits everyone. And, I think there could be potential situations where an end-user may want to try it, although I don’t have any specific known cases yet.

Basically, yes - it allows the application to not hardcode the quantization information but get it from the model.

Thanks for this clarification, that’s really helpful!

Let me know if this makes sense!

-Andrew

stoa · April 9, 2021, 7:25pm

@areusch @delorme-jm

Hello, Andrew

This can work. I will push a modified PR shortly. Thanks for the help.

Arthur

stoa · April 13, 2021, 8:05pm

@areusch

Hello, Andrew I have made a small test application testing our code emitter. It does not require a special docker - should work fine with the normal docker. How can I add it to the testsuite in order to run together with other tests ? I have placed it with tests/micro/stm32. Thanks in advance