[RFC] [uTVM] Embedded C Runtime Interface

Mousius · May 11, 2021, 1:11pm

Summary

This RFC outlines a set of additional APIs for the C Runtime to enable direct calling of an AOT micro entrypoint ([RFC] [uTVM] AOT optimisations for Embedded Targets) from a model descriptor which includes some model metadata, this is an alternative to the packed function API when working in embedded environments.

typedef struct {
	...metadata...,
	(TVMMicroEntryPoint*) entrypoint
} TVMModel; // Model descriptor to be used in static linkage
typedef struct {
	...,
	void** workspace;
} TVMContext; // Context configuration for minimal environments

// Execution function to execute a model in a given context
static inline int32_t TVMExecute(const TVMModel* model, void** inputs, void** outputs, TVMContext* context);
// Workspace setup function to assign the workspace to the context
static inline void TVMSetWorkspaces(TVMContext* context, void** workspace);
// Workspace size retrieval
static inline size_t TVMGetWorkspaceSize(const TVMModel* model, size_t workspace_index);

Motivation

As illustrated by @stoa in [RFC] Standalone Code Generation and C Runtime for STM32 bare-metal devices, an embedded specific entrypoint into TVM is desired and in order to access AOT from an embedded environment, it makes sense to provide a stable user facing API so as underlying changes in the output model can be transparent to system integrators. Providing stable interfaces for the facilities of the existing C runtime to an embedded environment provides similar guarantees and ease of use for those not using the packed function signature in TVM. This also provides TVM developers the ability to change the underlying micro runtime as TVM evolves with a stable outward facing interface.

One of the principles of the micro entrypoint is that it is providing a minimal amount of overhead when running in an embedded system, therefore a similarly minimal way to run a simple model is introduced which can be augmented by the wider C Runtime.

Guide-level explanation

This RFC aims to introduce the concepts to call the AOT micro entrypoint from an embedded application, as a starting point this proposal includes:

A model descriptor to give richer information about the model and wrap the micro entrypoint
A model context to store embedded environment information
Initial functions for managing memory workspaces

A user can include these as additional headers to allow a thin and stable interface for the AOT execution entrypoint, instead of having:

user_app.c

extern const TVMModel my_model;
my_model->entrypoint(inputs, outputs, my_context);

And having to understand the calling pattern of the AOT output, they can instead use:

user_app.c

#include "tvm_micro_runtime.h"
extern const TVMModel my_model;
TVMExecute(&my_model, inputs, outputs, &my_context);

This would be achieved by using minimal inline functions to mask the internal structure of TVMModel, such as:

tvm_micro_runtime.h

#include "tvm_micro_backend.h"
static inline int32_t TVMExecute(TVMModel* model, void** inputs, void** outputs, TVMContext* context) {
	return model->entrypoint(inputs, outputs, context);
}

tvm_micro_backend.h

typedef struct {
	...metadata...,
	(TVMMicroEntryPoint*) entrypoint
} TVMModel; // Model descriptor to be used in static linkage
typedef struct {
	...,
	void** workspace;
} TVMContext; // Context configuration for minimal environments

You can see this in two motivating user flows, compiling a model and then augmenting it with application level memory management.

Default Model Compilation

In this flow, the user is using tvmc to generate a model and an associated block of memory is allocated for it:

tvmc my_model.tflite --executor=aot --target=c --no-typed-operators --micro-entrypoint

For this flow, no additional context is required and the user can run the code on their device:

extern const TVMModel my_model;
void* inputs = {my_data};
void* outputs = {output_space};
TVMExecute(&my_model, inputs, outputs, NULL);

This is enabled by the use of of the a TVMModel structure generated by TVM to expose the AOT resources which can be constant and provided by the compiler output with relevant metadata for users to query.

Custom-Workspace Compilation

In this flow, the user is using tvmc to generate a model but specifies the memory available:

tvmc my_model.tflite --executor=aot --target=c --no-typed-operators --micro-entrypoint --with-memory=size=2048;access=rw

For this flow, the additional context is required to allow telling the runtime where the memory exists:

extern const TVMModel my_model;
TVMContext context;

void* inputs = {my_data};
void* outputs = {output_space};

TVMSetWorkspaces(&context, malloc(TVMGetWorkspaceSize(model, 0));
TVMExecute(&my_model, inputs, outputs, context);

This works because of the context which the model runs within, similar to the DLContext object but providing only information not hardcoded into the AOT output for a minimal runtime. By re-using the resource_handle pointer, the embedded context can also be used for operators run using packed functions and normal TVM buffers.

Reference-level explanation

In this RFC, we are primarily concerned with three areas; a model descriptor which the compiler generates, a context which the user can manipulate and an API file which binds the two together.

Model Descriptor

This is a formalisation of the model descriptor found in tvm/runtime/crt/internal/aot_executor/aot_executor.h, which can be used to describe a model via the APIs proposed:

typedef struct {
  uint32_t num_input_tensors;    /** Number of expected input tensors */
  uint32_t num_output_tensors;   /** Number of expected output tensors */
  size_t* workspace_size;         /** Size of workspace required for model to run */
  TVMMicroEntryPoint entrypoint; /** Generated model function, called through tvm_runtime_run */
} TVMModel;

This is the generated fixed model descriptor which users can address by name in the outputted code:

extern const TVMModel my_model;

Additional fields can be added here alongside suitable getters to retrieve information about a model. Notably, if the workspace isn’t specified by the user, it’ll default to being pinned within the generated code rather than being user accessible.

Context

Paired with the model descriptor, this provides any contextual information required to run the model, such as an application driven workspace configuration:

typedef struct {
	void** workspace; /** Pointers to different memory to use as a workspace */
} TVMContext;

Micro Entrypoint Runtime API

A header which can be added to the src/runtime folder alongside c_backend_api.h and c_runtime_api.h to provide the correct overlay to the matching C runtime. Using static inline functions each of the individual calls can be kept minimal and provide abstraction on top of the underlying model:

static inline int32_t TVMExecute(const TVMModel* model, void** inputs, void** outputs, TVMContext* context) {
	return model->entrypoint(inputs, outputs, context);
}
static inline size_t TVMGetWorkspaceSize(const TVMModel* model, size_t workspace_index) {
    return model->workspace_size[workspace_index];
}
static inline void TVMSetWorkspaces(TVMContext* context, void** workspace) {
	context->workspace = workspace;
}

Drawbacks

This starts to build up a minimal interface for interacting with TVM, which deviates from the main dynamic linked approach. It’s important to keep this layer as minimal as possible to allow other parts of TVM to continue doing the heavy lifting.

Combining this with the core C runtime means maintaining support across an incredibly broad range of devices from single core embedded devices to cloud environments and dynamically loading for autotuning.

Rationale and alternatives

Integrating with the current C Runtime gives us a way to assess and move forwards with embedded specific changes but alternatively a different runtime environment could be created entirely, this would mean reinventing every aspect of the runtime and would not leverage as much of the existing work.

Prior art

The setting up of an application workspace for a TVM model was first demonstrated in [RFC] Standalone Code Generation and C Runtime for STM32 bare-metal devices
AOT introduced the concept of a model descriptor in tvm/runtime/crt/internal/aot_executor/aot_executor.h within it’s introductory PR

Unresolved questions

Is this lightweight enough to allow usage of the C Runtime where useful to embedded applications?
Should we use the common C snake case style to better match embedded systems rather than the style used by the C runtime?

Future possibilities

By integrating and evolving the C Runtime API, TVM can be targeted across a broader range of targets than is currently possible with the current API. This section outlines some of the use cases we could extend this into, these are intended to be illustrative and will require their own RFCs.

Re-use of C Runtime APIs

Using the C runtime provides access to standard interfaces such as multithreading, with an RTOS specific implementation

Further Model Metadata

Any additional metadata can be added to the TVMModel structure as a minimal overhead, allowing for extension into a variety of use cases.

static inline int32_t TVMGetTVMVersionMajor(const TVMModel* model)  {
	return model->compiler_version_major;
}

Shared Workspaces

In this flow, the user disables the generation of a default memory block so as to allow for the application to define that memory:

tvmc my_model1.tflite --executor=aot --target=c --no-typed-operators --micro-entrypoint --with-memory=size=2048;access=rw

tvmc my_model2.tflite --executor=aot --target=c --no-typed-operators --micro-entrypoint --with-memory=size=4096;access=rw

And this can be loaded into the context for the executor to get the memory from:

TVMContext my_context;
size_t max_workspace_required = max(TVMGetWorkspaceSize(my_model1, 0), TVMGetWorkspaceSize(my_model2, 0));
TVMSetWorkspace(&my_context, malloc(max_workspace_required));

RTOS Device Integration

The context object can be defined per-platform to allow RTOS specific structures to be passed through to the operators:

struct device* my_accel = device_get_binding("ACC_0");
TVMSetDevice(&my_context, my_accel);

With an associated header-only platform wrapper, here is an example for the Zephyr RTOS:

#include <device.h>

typedef struct {
  void** workspace;
  struct device* device;
} TVMContext;

static inline void TVMSetDevice(TVMContext* context,  struct device* device) { 
   context->device = device;
}

Alongside new device drivers, this can provide an interface for operators to interact with RTOS drivers directly in the C runtime:

void TVMAcceleratorAccelerate(TVMContext* context, int32_t operation) {
	struct device* device = context->device;
	device_specific_rtos_call(device, operation);
}

Parameter Updating

By starting to uncover the alternative pathway into a more static execution environment we can start to provide methods of overwriting aspects of the model such as overwriting existing in-memory parameters:

static inline void TVMSetParameters(TVMContext* context, void** params) {
	context->params = params;
}

Which can then provide the potential for Over-the-Air updates of models to IoT devices.

Mousius · May 11, 2021, 1:12pm

cc: @areusch @giuseros @stoa @manupa-arm @grant-arm

areusch · May 12, 2021, 7:41pm

cc @MJKlaiber

@Mousius thanks for splitting this off into another RFC. I agree implementing a low-overhead embedded interface is super important. A couple thoughts:

At a high level, it would be great to explicitly spell out the entire interface we expect to implement here. I think it might be useful to include an entire main() program (either here or perhaps linked as a branch if it’s long) just to ensure we aren’t leaving anything out.

Runtime vs compile time knowledge

A key question we should tackle here is when model metadata should be available. Basically there are two scenarios:

S1. The user wants to use model metadata in the compilation flow.

S2. The user wants to write functions that make use of model metadata at runtime.

My opinion is we need to support both. So any metadata here e.g. stored in a struct should also be present in some JSON created as part of Model Library Format.

Model Input and Output Allocation

I think it’d be great to illustrate how we expect users to allocate model inputs and outputs. This is kind of there, but it would be great to propose the thing end-to-end. In particular, I’m curious how a user should size the tensors. One such possible sketch is to generate code like:

typedef struct {
    uint8_t input1[1 * 32 * 32 * 3];   // dimensions are examples
    int8_t input2[10 * 5 * 5 * 3];
} tvm_model_input_t;

This allows users with simple memory layout requirements to just declare the struct in the correct memory address space, and fill data as needed. It also serves as documentation-as-code of the required inputs and memory. We could move the buffer sizes to be constants, too. I want to ensure users retain control of all memory allocations, but we should design the API such that the typical case is very easy to use.

Custom-Workspace Compilation

I would take this a step further and ask if we can make the workspace size a #define constant such that the user could allocate the space at compile time. Or whether we expect this to live in the Model Library Format metadata as a means to access it at compile time. For example, instead of:

TVMSetWorkspaces(&context, malloc(TVMGetWorkspaceSize(model, 0));
TVMExecute(&my_model, inputs, outputs, context);

I’d like people to be able to:

uint8_t g_workspace[TVM_MODEL_NAME_WORKSPACE_BYTES];

void main() {
  TVMSetWorkspaces(&context, g_workspace);
}

Finally, is it possible that whatever context is needed to identify the workspace could optionally live in flash? This has some benefits e.g. in simple deployment scenarios when the workspace is allocated as global memory. In this case, it’s not possible to overwrite it with invalid pointers, which is a class of bugs that can be hard to trace down on embedded platforms

Context

Paired with the model descriptor, this provides any contextual information required to run the model, such as an application driven workspace configuration:
typedef struct {
	void** workspace; /** Pointers to different memory to use as a workspace */
} TVMContext;

I’d like to avoid general-purpose structs if possible, at least at this phase of the implementation. While I think it’s likely some top-level glue struct will eventually be a useful entry point for developers (and something is likely going to be needed as resource_handle, I think there are still quite a few things related to e.g. multi-core and accelerator dispatch yet to be decided. Rather than provide a sort of “kitchen sink” struct, I’d like to encourage us to define dedicated places for each orthogonal aspect of computing the I think it’d be great to make progress on the API in this RFC and tackle the accelerator dispatch question in a follow-on.

Generated APIs vs function pointers

When considering how to write user-facing APIs, I think we have a couple of choices:

G1. Generate a function call table e.g. TVMModel and write wrapper functions around it.

G2. Generate a wrapper function with a standard interface (or perhaps a standard templated model interface).

Here, I’m not necessarily proposing to generate a wrapper function with model-specific signatures (though that has been proposed elsewhere). Instead, I am just wondering whether it’s necessary to place the entrypoint function pointer in TVMModel. It seems like we may have some desire to generate model-specific C++ metadata outside of that generated by the AOT codegen, so I wonder if it’s worth it to just build a small codegen dedicated to this user-facing API now. Doing this would also remove the need for “accessor” functions such as TVMGetTVMVersionMajor.

Accelerator binding

If possible, I’d like to defer this to a separate RFC. I think there are lots of questions to be answered there and it’d be necessary to review a lifecycle diagram of the accelerator to do so. I think that would be better placed in a separate RFC.

Mousius · May 14, 2021, 5:04pm

Thanks for your reply @areusch, it’s interesting to read your thoughts. I think one of the core ideas you’re suggesting is a metadata header, rather than the struct? Similar to:

#define TVM_MODEL_my_model_WORKSPACE_0_SIZE
#define TVM_MODEL_my_model_TVM_VERSION_MAJOR 0
#define TVM_MODEL_my_model_TVM_VERSION_MINOR 8

I think we’ll likely want a “model header” for the extern TVMModel my_model eventually anyway, so I’d be interested to hear of use cases that aren’t viable with this as defines rather than in the struct; such use cases aren’t immediately coming to mind.

I’ll put it here as a reply as I think we need to go through the discussion here, I’ve integrated your suggestions around using defines for the size of the buffers required; Let’s use the following snippets for the workflows I described initially, it’s important to note the code generation of the initial workspace could be done by TVM in the first case.

Default Model Compilation

This flow assumes code generation has output a suitably sized workspace in the top level lib0.c or similar, so the user doesn’t specify it, this would be the starting point for someone building an app with the generated output:

#include "tvm_my_model.h" // Metadata and extern definition
#include "tvm_micro_runtime.h" // Runtime interface

uint8_t model_input[TVM_my_model_INPUT_SIZE]; // User defined input buffer

int main() {
    get_features_for_processing(&model_input); // Fill the input buffer
    int8_t* model_output[TVM_my_model_OUTPUT_SIZE]; // Output buffer
    void** inputs = {model_input}; // Single input
    void** outputs = {model_output}; // Single output
    TVMExecute(&tvm_my_model, inputs, outputs, NULL);
}

Custom-Workspace Compilation

This is where the user has passed an argument to tell code generation to not allocate a workspace. This is where it gets more interesting as you now need to specify the workspace:

#include "tvm_my_model.h" // Metadata and extern definition
#include "tvm_micro_runtime.h" // Runtime interface

uint8_t model_input[TVM_my_model_INPUT_SIZE]; // User defined input buffer
uint8_t workspace[TVM_my_model_WORKSPACE_0_SIZE]; // User created workspace of size specified by TVM

int main() {
    TVMContext my_context; // Information for how to run in this environment
    TVMSetWorkspaces(&my_context, {&workspace});

    get_features_for_processing(&model_input); // Fill the input buffer
    int8_t* model_output[TVM_my_model_OUTPUT_SIZE]; // Output buffer
    void** inputs = {model_input}; // Single input
    void** outputs = {model_output}; // Single output
    TVMExecute(&tvm_my_model, inputs, outputs, &my_context);
}

How this is represented in the JSON wasn’t something I was intending to address here but I agree that this should be accessible in the Model Library Format for other toolchains to interact with.

Could you expand on what you mean here? In a default flow I’d expect the workspace to be generated and codegen to pass that to the underlying operators, that means it would exist by default in flash for “Default Model Compilation”.

Everything under Future possibilities is illustrative of the API potential, so I agree to take conversation to a new RFC.

My core motivation here is to provide a stable API for interacting from a user application, so you can run:

TVMExecute(<model>, <inputs>, <outputs>, <context>);

Which should work as we move forwards with TVM and users recompile their models however the model is internally represented. Similarly, however we wish to structure the context is hidden from the user and we can augment it with the use-cases as they arrive, but still be able to run:

TVMSetWorkspaces(<context>, <list of workspaces>);

In such an environment, each piece of orthogonal information can exist on the TVMContext struct as we grow it. Do you have an alternative you can outline?

areusch · May 14, 2021, 9:09pm

@Mousius thanks for your reply! I think we are nearly aligned here.

I think we’ll likely want a “model header” for the extern TVMModel my_model eventually anyway, so I’d be interested to hear of use cases that aren’t viable with this as defines rather than in the struct; such use cases aren’t immediately coming to mind.

I think #define should definitely suffice for anything used in the downstream program. The main other use cases are those which consume this stuff outside of the C compiler (e.g. Python tools which analyze the generated code). Those I think would prefer JSON. I think so long as each parameter in the header is represented in the metadata.json, this concern is addressed. It would be nice to consider a standard way to map metadata.json keys to #define name.

Default Model Compilation

Some points of feedback:

I think we should also consider a use case with multiple model inputs and how a user should identify the order to populate inputs.
Similar suggestion for outputs.
I do think here, inputs and outputs could be a struct which would allow for:

inputs.my_input = model_input

rather than

inputs[TVM_my_model_my_input_INDEX] = model_input
Presumably TVM_my_model_INPUT_SIZE would similarly need to include the name of the input.
TVMExecute’s return code should be checked. It would be great to consider the error case here too (generally, I imagine it’s just printing the error code and/or TVMGetLastError() (see discussion here) and cc @stoa.

Custom Workspace Compilation

I think this seems fine, don’t have anything to add over the other thing here (except see comment on TVMContext below).

Other commentary

areusch:

So any metadata here e.g. stored in a struct should also be present in some JSON created as part of Model Library Format.

How this is represented in the JSON wasn’t something I was intending to address here but I agree that this should be accessible in the Model Library Format for other toolchains to interact with.

Okay, we can iterate on this in future RFCs/PRs.

Could you expand on what you mean here? In a default flow I’d expect the workspace to be generated and codegen to pass that to the underlying operators, that means it would exist by default in flash for “Default Model Compilation”.

Here I just mean the TVMContext struct itself. I’m not sure its necessary to place it on the stack when everything is a global.

My core motivation here is to provide a stable API for interacting from a user application, so you can run:
TVMExecute(<model>, <inputs>, <outputs>, <context>);
Which should work as we move forwards with TVM and users recompile their models however the model is internally represented. Similarly, however we wish to structure the context is hidden from the user and we can augment it with the use-cases as they arrive, but still be able to run:
TVMSetWorkspaces(<context>, <list of workspaces>);
In such an environment, each piece of orthogonal information can exist on the TVMContext struct as we grow it. Do you have an alternative you can outline?

I think that’s a reasonable goal. I’m okay keeping TVMContext so long as we add internal organization, then:

struct TVMContext {
  struct TVMMemoryLayout {
    uint8_t* workspaces;
  } memory_layout;
  struct TVMDeviceMapping {
    tvm_model_name_device_id_t device_id;
    tvm_device_t device_instance;
  } *devices;
};

Mousius · May 20, 2021, 3:18pm

Great to see we’re getting there @areusch , though keen to get feedback from others.

Just to clarify where you’re heading here, I’m assuming the layers it’d go through are something like:

main.c

TVM_my_model_input inputs = {
    .cats = &model_input_cats,
    .dogs = &model_input_dogs
}; // Double input
TVM_my_model_output outputs = {&model_output}; // Single output
TVMExecute(&tvm_my_model, &inputs, &outputs, NULL);

tvm_micro_runtime.h

static inline int32_t TVMExecute(TVMModel* model, void* inputs, void* outputs, TVMContext* context) {
    return model->entrypoint(inputs, outputs, context);
}

generated_entrypoint.c

static int32_t entrypoint(TVMModel_My_model_input* input, TVMModel_My_model_output* output, TVMContext* context) {
    return tvm_my_model_run_func(input->cats, input->dogs, output->output_name, context);
}

Which seems neat enough, and I’m happier with having some types involved at the top level for users. I’m assuming we’d also want to include this in the model header file we’d generate which would make it slightly more than just metadata, but I would prefer that over several headers for a model.

Yip, I didn’t cover error handling in my example but the idea is just to propagate any returns from the AOT run_func back up the chain for the user to collect regardless of how that error is presented internally in TVM? If we need to check the actual error you can then use the other error functions?

Ah, gotcha, I think that’s best left up to the application developer as to where and how they initialise it.

Are you ok with deciding the exact level of internal organisation required at the code review stage for each incremental addition?

Revised Examples

Based on your feedback @areusch, I’ll provide some further examples of what I think we’ve discussed. I’ve used names for the inputs, outputs and workspaces, it’s worth considering whether we want to provide the facility for opting out of that for single input/output models simplicity?

Generated Model Header

Example for illustration:

tvm_my_model.h

// Input metadata
#define TVM_my_model_INPUT_CATS_SIZE 1234
typedef struct {
    uint8_t* cats;
} TVM_my_model_input;

// Output metadata
#define TVM_my_model_OUTPUT_CLASS_SIZE 12
typedef struct {
    int8_t* class;
} TVM_my_model_output;

// Model reference
extern const TVMModel my_model;

Default Model Compilation

#include "tvm_my_model.h" // Metadata, structs and extern definition (generated)
#include "tvm_micro_runtime.h" // Runtime interface (not generated)

uint8_t model_input_cats[TVM_my_model_INPUT_CATS_SIZE]; // User defined input buffer

int main() {
    get_some_features_for_processing(&model_input_cats); // Fill the input buffer

    int8_t model_output[TVM_my_model_OUTPUT_CLASS_SIZE]; // Output buffer
    TVM_my_model_input inputs = {&model_input_cats}; // Single input
    TVM_my_model_output outputs = {&model_output}; // Single output
    TVMExecute(&tvm_my_model, &inputs, &outputs, NULL);
}

Custom-Workspace Compilation

#include "tvm_my_model.h" // Metadata, structs and extern definition (generated)
#include "tvm_micro_runtime.h" // Runtime interface (not generated)

uint8_t model_input_cats[TVM_my_model_INPUT_CATS_SIZE]; // User defined input buffer
uint8_t workspace[TVM_my_model_WORKSPACE_SRAM_SIZE]; // User created workspace of size specified by TVM

int main() {
    TVMContext my_context; // Information for how to run in this environment
    TVMSetWorkspaces(&my_context, {&workspace});

    get_some_features_for_processing(&model_input_cats); // Fill the input buffer

    int8_t model_output[TVM_my_model_OUTPUT_CLASS_SIZE]; // Output buffer
    TVM_my_model_input inputs = {&model_input_cats}; // Single input
    TVM_my_model_output outputs = {&model_output}; // Single output
    TVMExecute(&tvm_my_model, &inputs, &outputs, &my_context);
}

Multiple Input Model

#include "tvm_my_model.h" // Metadata, structs and extern definition (generated)
#include "tvm_micro_runtime.h" // Runtime interface (not generated)

uint8_t model_input_cats[TVM_my_model_INPUT_CATS_SIZE]; // User defined first buffer
uint8_t model_input_dogs[TVM_my_model_INPUT_DOGS_SIZE]; // User defined second buffer

int main() {
    get_some_features_for_processing(&model_input_cats); // Fill the first input buffer
    get_some_more_features_for_processing(&model_input_dogs); // Fill the second input buffer

    int8_t* model_output[TVM_my_model_OUTPUT_CLASS_SIZE]; // Output buffer
    TVM_my_model_input inputs = {
        .cats = &model_input_cats,
        .dogs = &model_input_dogs
    }; // Double input
    TVM_my_model_output outputs = { .class = &model_output }; // Single output
    TVMExecute(&tvm_my_model, &inputs, &outputs, NULL);
}

Mousius · June 18, 2021, 2:30pm

Apologies @areusch, I’ve left this thread alone for a rather long time and wanted to respond to your comment in the memory planning thread to do with generating structs for context. As we discussed on Discord, this inspired me to revisit your previous user-facing API comment:

Looking at this API now, it is definitely feeling like we’d be limiting how user-friendly we can make this API if we use the originally suggested interface. As such, I’ve proposed an implementation in https://github.com/apache/tvm/pull/8280 which uses a standard interface with the four currently known slots (inputs, outputs, memory, context) in a model entrypoint function (currently tvm_default_run as the model is named tvm_default until https://github.com/apache/tvm/pull/8014 lands):

/*!
 * \brief TVM default model input tensors 
 */
struct tvm_default_inputs {
        void* i;
};

/*!
 * \brief TVM default model output tensors 
 */
struct tvm_default_outputs {
        void* output;
};

/*!
 * \brief TVM default model memory blocks 
 */
struct tvm_default_memory {
};

/*!
 * \brief TVM default model device configurations 
 */
struct tvm_default_devices {
};

/*!
 * \brief TVM default model run function 
 * \param inputs Input tensors for the model 
 * \param outputs Output tensors for the model 
 * \param memory Memory blocks for the model to use 
 * \param devices Devices for the model to use 
 */
int tvm_default_run(
        struct tvm_default_inputs* inputs,
        struct tvm_default_outputs* outputs,
        struct tvm_default_memory* memory,
        struct tvm_default_devices* devices
);

The other observation I had from [RFC] Unified Static Memory Planning was that the memory could be specified a bit differently:

S1. Specify each type of memory such as parameters and workspaces with differing structures
S2. Specify each type of memory such as parameters and workspaces as nested structures
S3. Specify the names for these memories only and the user already has the knowledge as to which block they wanted to use

S1 and S2 introduce the difficulty of what do we name each piece and how do we make it easy for users to setup their memory, so for the above example I’ve gone with S3 - it seems to be more straight forward but would like input from @manupa-arm as to whether this will be better or worse

areusch · June 18, 2021, 4:39pm

@Mousius ah I had also forgotten to reply to your last post.

I’ve taken a look at your PR and will post some initial feedback. Overall I think this approach is cleaner than the previous one–it removes the adaptation layer between TVM and C, and by doing so, encourages us to minimize the embedded-facing API.

One question: with the USMP proposal, what should be the input to the TIR main_func? It seems like the ultimately-proposed thing was something like:

int32_t tvmgen_my_model_main_func(void* inputs, void* outputs, void* params1, void* params2, void* workspace1, void* workspace2) {
   tvmgen_my_model_op1(inputs[0], params1[0], &workspace1[1024]);
   tvmgen_my_model_op2(inputs[1], params1[128], &workspace1[1024], outputs[0]);
}

This may mean that we need to translate between tvmgen_my_model_input_t and the array-of-input-data vector inputs when USMP lands. I wrote everything below before considering this; just noting that the offset of struct members isn’t guaranteed, so not sure how interchangeable tvmgen_my_model_input_t and void* inputs is; on the other hand, when LLVM-generated bitcode indexes into a struct, (I believe) it does so by offset; so we need to ensure generated code either a) accepts arrays or b) can predictably generate offsets in the same way we expect downstream c compilers to.

cc @manupa-arm

Regarding the input/output/memory/devices split:

I can see the purpose for input, output, memory at this point, and I expect devices will have one, but right now it’s not super clear
It would be great (in the PR) to document exactly what goes in each generated struct (e.g. in terms of structs or C types).
I wonder if we should defer the devices struct until we have defined a model for devices in µTVM?

The other observation I had from [RFC] Unified Static Memory Planning was that the memory could be specified a bit differently:

S1. Specify each type of memory such as parameters and workspaces with differing structures

S2. Specify each type of memory such as parameters and workspaces as nested structures

S3. Specify the names for these memories only and the user already has the knowledge as to which block they wanted to use

I think the challenge here is deciding how to place parameters. At present, a global symbol is defined for each parameter, and that is used in code to reference the data. Adding an entry in memory would be superfluous here. Additionally, since parameters need to be initialized, the C interface would need to provide e.g. a #define which expands to the C array or C-string initialization of the parameter data.

Finally, since parameter parsing was previously demonstrated to be a bottleneck in the downstream compilation time, I think it’s likely that users would prefer the llvm target to avoid this problem; and with llvm, parameter placement is more difficult (since parameters are simplified during TVM compilation, the user would need to define their placement either universally e.g. --params-section or based on some property of the parameter known in advance to the user e.g. param_placement={"cpu": {"section": "my_param_section"}, "accel": {"section": "accel_params"}}. Such more complex use cases are why I’ve included params in Model Library Format (downstream project generators can then place them as needed).

So in conclusion, I think we should treat memory as just for workspace memory, and leave params as-is for now.

Programmatic inference

One thing with this approach is: suppose there is a case where someone wants to drive inference programmatically. As an example API, consider the existing Module-based Model Runtime Interface:

int32_t tvmgen_my_model_set_input(const char* name, TVMValue value);

This API could be implemented on top of the struct API using generated code as follows:

First let’s get rid of the C-string and presume there is a mapping from string to int: int32_t tvmgen_my_model_set_input(tvm_input_name_t name, TVMValue value);

Next, we could emit code to generate these assignments using e.g. memcpy:

size_t tvmgen_my_model_input_offsets = {&((TVM_my_model_input*) 0)->model_input_cats, &((TVM_my_model_input*) 0)->model_input_dogs};

int32_t tvmgen_my_model_get_input_index(const char* name, int* index) {
  // string lookup
}

int32_t tvmgen_my_model_set_input(tvmgen_my_model_input_index_t index, TVMValue value, void* model_input_struct) {
  TVM_my_model_input* input = model_input_struct;
  memcpy(((uint8_t*) input) + tvmgen_my_model_input_offsets[name], value.v_handle, sizeof(value.v_handle));
}

const tvm_model_descriptor_t tvmgen_my_model_descriptor = {
    .input_struct_size_bytes = sizeof(tvmgen_my_model_input_t),
    .set_input = tvmgen_my_model_set_input,
    .get_input_index = tvmgen_my_model_get_input_index
};

And code could use this like so:

typedef struct {
   void* input_struct;
   void* output_struct;
   void* memory_struct;
} model_state_t;

model_state_t g_model_state = {0, 0, 0};
void tvm_prog_init_model(tvmgen_model_descriptor_t* descriptor) {
    model_state.input_struct = malloc(descriptor->input_struct_size_bytes);
    // ...
}

int32_t tvm_prog_set_input(const char* name, TVMValue value) {
    int index;
    descriptor->get_input_index(name, &index);
    descriptor->set_input(index, value);
}

I’m not so much trying to propose we nail this down exactly here, just thinking aloud how we might do this. I think this approach would work and allow us to continue using the structs–this wouldn’t need to be emitted in all use cases, just when programmatic access was desired. let me know if this seems reasonable to you.

cc @stoa

Replies to previous post

Yip, I didn’t cover error handling in my example but the idea is just to propagate any returns from the AOT run_func back up the chain for the user to collect regardless of how that error is presented internally in TVM? If we need to check the actual error you can then use the other error functions?

That should work–generally, as long as we’re returning the exact int32_t error code up, I think it should be sufficient.

Are you ok with deciding the exact level of internal organisation required at the code review stage for each incremental addition?

(if we are still using TVMContext, not sure now) I think so–I think it will make more sense as we are adding things.

I’ve used names for the inputs, outputs and workspaces, it’s worth considering whether we want to provide the facility for opting out of that for single input/output models simplicity?

I don’t really favor this too much–it’s sort of an API shortcut that is hard to implement in C and, while it does make the getting-started flow a bit easier, it adds potential confusion later on.

I’m assuming we’d also want to include this in the model header file we’d generate which would make it slightly more than just metadata, but I would prefer that over several headers for a model.

I agree generating a single header file makes sense to me. #include a header should not modify the program size.

manupa-arm · June 21, 2021, 2:31pm

Hi @areusch,

The following is bit different from what we had in mind.

areusch:

int32_t tvmgen_my_model_main_func(void* inputs, void* outputs, void* params1, void* params2, void* workspace1, void* workspace2) {
   tvmgen_my_model_op1(inputs[0], params1[0], &workspace1[1024]);
   tvmgen_my_model_op2(inputs[1], params1[128], &workspace1[1024], outputs[0]);
}

Our thoughts are the offsets will be in the operator implementation. First, let me state what we had in mind, using your example :

int32_t tvmgen_my_model_main_func(void* input0, void*  input1, void* output0, void* params1, void* params2, void* workspace1, void* workspace2) {
   tvmgen_my_model_op1(input0, params1, workspace1);
   tvmgen_my_model_op2(input1, params2, workspace1, output0);
}

The operator TIR code will grab the data from the designated offsets when the buffers are passed in. In the USMP RFC, I just corrected a typo (where it had input[0] along with inputs.input0). The idea there is as follows :

static int32_t entrypoint(TVMInputs_my_model* inputs, 
                          TVMOutputs_my_model* outputs,
                           ...){
     return my_model_main(inputs.input0, 
                          outputs.output0,
                          ...);
}

The unpacking of inputs struct to individual tensors could happen here.

I dont think this is the case as TIR main_func (or lowered C/LLVM function does not need arrays as inputs and outputs; they are accessed as individual tensors).

I think C compiler do respect the order in the struct so the offsets should work in the LLVM IR.

areusch:

The other observation I had from [RFC] Unified Static Memory Planning was that the memory could be specified a bit differently:

S1. Specify each type of memory such as parameters and workspaces with differing structures

S2. Specify each type of memory such as parameters and workspaces as nested structures

S3. Specify the names for these memories only and the user already has the knowledge as to which block they wanted to use

I think the challenge here is deciding how to place parameters. At present, a global symbol is defined for each parameter, and that is used in code to reference the data. Adding an entry in memory would be superfluous here. Additionally, since parameters need to be initialized, the C interface would need to provide e.g. a #define which expands to the C array or C-string initialization of the parameter data.

Finally, since parameter parsing was previously demonstrated to be a bottleneck in the downstream compilation time, I think it’s likely that users would prefer the llvm target to avoid this problem; and with llvm, parameter placement is more difficult (since parameters are simplified during TVM compilation, the user would need to define their placement either universally e.g. --params-section or based on some property of the parameter known in advance to the user e.g. param_placement={"cpu": {"section": "my_param_section"}, "accel": {"section": "accel_params"}}. Such more complex use cases are why I’ve included params in Model Library Format (downstream project generators can then place them as needed).

So in conclusion, I think we should treat memory as just for workspace memory, and leave params as-is for now.

Once the USMP lands, all the parameters will be pooled depending on the parameter buffers provided by the user. The parameters will be pooled depending on device accessibility that is passed on with them.

I think going with C, we would still like to expose the parameter pointers and workspace pointers to be provided from the application layer. Having exposed the pointers in the entry_point, we could still provide #defines, as you mention, in a metadata.h to aid initialization.

Yes, going forward, when we want LLVM and pinning of large (though not sure how large we could be in the embedded space) constants we could provide an exception for this route. The only sensible way, I am seeing this is we could generated user-given symbol names (the parameter-buffer name) for parameter buffers in the metadata.o – in a way a linker script could place them accordingly.

I believe @Mousius has more to add here.

Mousius · June 21, 2021, 3:44pm

Thanks @manupa-arm, I think that clears up a lot of things

I included some generated documentation, but most of the values are pointers as of now. We could potentially write:

void* cats; // Pointer to memory containing cats input of shape (4,4)

What do you think @areusch? I haven’t included the generated size macros in my PR which would probably help as well.

I can see some value in not generating it now as we don’t know the structure of it but given we want to drive the models programmatically we’d need an interface consistent between models, otherwise you can’t have a generic function which calls into it such as:

model_desc->run(
  model_state->inputs, 
  model_state->outputs,
  model_state->memory,
  model_state->devices
);

I don’t see this as an issue as it’s just a blank struct right now which can be filled with whichever implementation we require of it? Allowing the user to pick an accelerator and pass it in from an RTOS driver is a clean way of providing it to TVM rather than giving all that information to TVM behind the scenes; something similar to Zephyr’s device_get_binding for whichever device you happen to be targeting for example.

I’d propose that TVM doesn’t get into attempting to define memory sections itself as there’s more than one linker and different compiler support so that’s up to the application developer or RTOS integrations; as @manupa-arm mentions, we can provide consistent symbol names for this but the application developer should know their linker and system configuration better. In the C case it’d be nice not to have users both passing pointers in at the top and having to write a linker script so I would suggest we keep it consistent for now with pointers to the various memories; we can consider the LLVM case separately. Would this be a good case for something like the project API to generate an example app for a known system?

As a side note, I had some early thoughts towards over the air updates for internet of things devices which would require the parameters be customised by the user so it’d be useful to be able to swap that pointer easily rather than assume the parameters are always constant in the flash.

I’m agreed with the approach of illustrating that this is achievable using this API, I’d be saddened if we ruled out such use cases by accident. One alternative I’d put forward is that as the structs are themselves quite minimal, we could generate everything in a single header and leave the packed/unpacked run function available for such extension - I was considering if we should move the entrypoint to being a static function within the header as well?

areusch · July 28, 2021, 7:50pm

@manupa-arm @mousius some follow-up here since I’m reviewing https://github.com/apache/tvm/pull/8280 more now:

Our thoughts are the offsets will be in the operator implementation.

What’s the difference between computing these in the main_func vs in the operator implementations. A disadvantage is that when invoking any operator implementation standalone, you need to have the full workspace allocated. This could make it harder to chase memory problems (e.g. you can’t reduce the amount of memory allocated and delivered to a standalone operator)

areusch:

This may mean that we need to translate between tvmgen_my_model_input_t and the array-of-input-data vector inputs when USMP lands.

I dont think this is the case as TIR main_func (or lowered C/LLVM function does not need arrays as inputs and outputs; they are accessed as individual tensors).

I think the struct TVMInputs_my_model approach is the better user-facing API. I would like to ensure that we can support a use cases which may implement some type of generic AOT driver function. Those use cases may prefer to operate in terms of void* inputs[] and void* outputs[]; I think these actually may be laid out in memory identically to TVMInputs_my_model, and so only a mapping from param_name → index is needed. If that turns out not to hold, we may need to map param_name → byte offset (into TVMInputs_my_model). I think this should be fine, just wanted to describe this use case to ensure you’re ok with it. It should not affect those applications which don’t want to use a generic driver.

I think going with C, we would still like to expose the parameter pointers and workspace pointers to be provided from the application layer. Having exposed the pointers in the entry_point, we could still provide #defines, as you mention, in a metadata.h to aid initialization.

I think this is fine. The exposed thing should include a pointer table, IMO, so that individual parameters could be overridden at runtime if necessary. wdyt?

I’d propose that TVM doesn’t get into attempting to define memory sections itself as there’s more than one linker and different compiler support so that’s up to the application developer or RTOS integrations; as @manupa-arm mentions, we can provide consistent symbol names for this but the application developer should know their linker and system configuration better

Since we are likely preferring to emit LLVM object files, I do think we have to be somewhat concerned about sections and pass a degree of configuration up from LLVM API to the user. Section and alignment seem like reasonable parameters to expose. The use case I want to support is writing a script to generate a project based on an arbitrary MLF using some embedded platform.

areusch:

It would be great (in the PR) to document exactly what goes in each generated struct (e.g. in terms of structs or C types).

I included some generated documentation, but most of the values are pointers as of now. We could potentially write:
void* cats; // Pointer to memory containing cats input of shape (4,4)
What do you think @areusch? I haven’t included the generated size macros in my PR which would probably help as well.

Yeah I think that’s fine–I meant more like struct-level \brief comments (e.g. for inputs: \brief Contains a pointer to the tensor data for each input to my_model)

Mousius · July 29, 2021, 10:18am

I believe this should function as you’re intending and I do believe this is a valid use-case. My only concern is with relying on the structures matching the arrays memory layout but you’ve articulated a further workaround to that if required. If it becomes cumbersome, this is itself a thin layer over the AOT main and we could toggle a pointer array, or such generic use cases can instead use the packed API?

I did envisage something like:

struct tvmgen_my_model_memory {
  void* sram, 
  void* flash,
  void* params
};

Where-as I believe you want to fine grain the parameters as well:

struct tvmgen_my_model_memory {
  void* sram, 
  void* flash,
  void* param0,
  void* param1
};

I don’t see an issue with breaking down the parameters further so we can set them for individual parts of the model. This could be useful for partial OTA updating

I think you’d still require a linker script to move the relevant sections to the right memory locations, so it should be possible to generate a linker script to instead move specific symbols and enforce alignment?

I’ve put an example output in the PR, it has struct-level brief such as this, so hopefully that’s good

stoa · September 13, 2021, 3:45pm

Hello, I have an impression that we do not allow the inputs/outputs to live and share the memory with the internal activations. This may be important for small models. Correct me if I am wrong here ?

areusch · September 13, 2021, 5:29pm

@stoa inputs cannot but theoretically outputs could be shared so long as this doesn’t enlarge the size of the output tensors.