Implementing AOT in TVM

giuseros · February 23, 2021, 10:54pm

Motivation

In its current state TVM compilation flow produces two (or optionally three) artifacts:

The library containing the operators
The parameters of the networks (i.e., the weights of the model). This has been made optional with the --link-params option to the host target.
A json file describing the control flow of the network. i.e., which operators should be run.

Generally speaking, the TVM runtime consists in an interpreter reading the json file (3) and - optionally - the weights (2) and calling the operators contained into the library (1). This mechanism is described in the following picture:

While the params.bin is optional (i.e., we can omit it if we provide a --link-params flag to the compilation, see this RFC), the json file network.json is mandatory.

To be clear, there is actually no flow that allows the user to not provide the json file to the runtime.

This is a problem for two main reasons:

This workflow is very hard to implement on a micro-controller, since memory is usually a costly resource in embedded environments, and the json file is usually quite large.
We have a split memory allocation in the current TVM stack, where inter-operator memory is managed at json/relay level while the intra-operator memory is managed at TIR level

We at Arm are working on an AOT flow to get rid of the Json, and to transform the above graph into the following single-artifact scenario:

The user can optionally specify the name of the network, so that we can have a network_resnet.o, network_mobilenet.o, etc… For this RFC we will refer to a generic network.o (as shown in the picture).

The idea in the above image is that the network.o will expose a runner function which will take care of calling the different operators in the same library. We will temporarily call this function run_func, but naming is something that we will need to define.

The aim of this RFC is to provide a source of discussion on different topics that can help us through the development of this feature. The main topics of interests are

Code generation (i.e., IRModule + runtime::module generation)
Runtime API

Code generation

The aim of code generation is to go from a Relay graph to a runtime::module containing the control execution of the graph. We split this process in two different parts:

runtime::Module generation
runtime::Module bundling

TIR code generation

In recent times TIR has been augmented with runtime functionalities (e.g., the possibility to return a value ) which makes it ready to handle runtime code like creating NDArrays, NDShapes, calling functions, returning values, etc…

This solution provides different benefits:

We would be reusing a lot of TIR functionalities (less code duplication)
We can set the foundations to implement a global memory planner, thus reducing the memory footprint of the network (which is quite valuable for microcontrollers)

The signature of the run_func generated by the TIR code generator would be the same of any packed function:

int32_t run_func(void* args, void* arg_type_ids, int32_t num_args, void* out_ret_value, void* out_ret_tcode, void* resource_handle);

In the following sub-sections we highlight some details about the code generation process.

Unpacked calls

Our code generator would issue tir.extern calls, manually packing/unpacking the arguments for the different operators contained in the library (very similar to what happens in the lower_builtin pass). In this way, we are de facto bypassing the function registry.

Runner descriptor

While it would be possible to directly expose run_func in the generated library we would wrap this function around a descriptor, i.e., a struct with the following fields:

typedef struct {
  int32_t (*run_func)(void* args, void* arg_type_ids, int32_t num_args, void* out_ret_value, void* out_ret_tcode, void* resource_handle);
  uint32_t num_input_tensors;
  uint32_t num_output_tensors;
} tvm_model_t

Having run_func wrapped within a descriptor provides with different benefits:

We can use the fields of the descriptor as a mean for the application to check the sanity of the arguments passed to the run_func
This will be the only entry point that needs to be exposed by the library network.o

Name mangling

TVM codegen should not invade the application symbol namespace and use the “implementation defined” namespace , which in C and C++ like languages (or indeed in Elf symbol land) is any symbol name prefixed with a _ . Further symbol names should be unique so that multiple models compiled can be statically linked into the same application. This can be achieved with the following changes:

The user will specify a name for the network to compile, and the global names will be suffixed with this name
The inner operators and the run_func will be declared “static” within the library. In this case we shield them from the outside world and we only expose the tvm_model_t entry point (which will be properly suffixed).

Parameters

For now we will assume that the parameters are linked within the library: in other words the flag --link-params is mandatory with the AOT flow.

Bundling all the modules together

In our design we will store the generated IRModule as an additional field of the LoweredOutput data structure.

struct LoweredOutput {
  std::string graph_json;
  Map<String, IRModule> lowered_funcs;
  IRModule aot_runtime; // This is the module generated that contains network_runner code
  Array<tvm::runtime::Module> external_mods;
  std::unordered_map<std::string, std::pair<int, const tvm::runtime::NDArray>> params;
};

We can then pass the aot_runtime module to CreateMetadataModule :

aot_ir_module = getAOTModule();
auto aot_mod = tvm::build(aot_module, target_host, target_host);
ret_.mod = tvm::codegen::CreateAOTModule(ret_.params, ret_.mod, ext_mods, aot_mod, GetTargetHost());

In the above snippet of code, the function CreateAOTModule will take care of adding the run_func definition in the library and will import the other modules (so that run_func will be common to all the modules).

Target host specification

To kick in the AOT flow we propose to add an additional runtime, namely aot, to the list of existing runtimes available in the target host.

The target host to generate an AOT-ready library would look like:

target_host = 'c  --runtime=aot --link-params'

Please note that we don’t need to specify --system-lib anymore, since the system library won’t be included in the generated library.

Runtime changes

This section is about how we can expose to the user the content of the generated library network.o.

Our idea is to create an additional aot_runtime folder which would live next to the crt and graph runtime folders. In this way all the other flows will still be available and unchanged, and in the meanwhile we can gradually extend the aot runtime to support different use cases.

Before we move on in this section, let’s clarify the difference between the aot_runtime and the graph_runtime:

Graph runtime - is the runtime used to read the json and to call the operators within the library.
AOT runtime - this represents a shim layer containing helper functions to carry on the execution of the network

Graph runtime removal

The graph runtime in the current state takes care of:

Initializing the Function Registry
Initializing the memory manager
Reading the json and calling into the functions defined in the generated library

With the AOT flow we got rid of (3), and by issuing unpacked calls we avoid the use of the Function Registry(1). The memory handling can be pushed directly into the aot runtime.

To be absolutely clear, we won’t need any Graph Runtime within the aot flow, since this is provided already by the generated library.

AOT runtime

The AOT runtime represents the shim layer provided to the user to invoke the given network compiled in the generated library. The API should include:

Memory handling (which is traditionally part of the Graph Runtime, which we removed).
Helpers to create DLTensors
Helpers to invoke run_func inside the generated library

We will be developing the AOT runtime as a C API, so that it will be easy to deploy AOT flows on embedded devices.

It would not be extremely hard in the future to add a C++ API.

User API

Let’s try to flash out what the aot runtime user API should look like below:

// Helper function to initialize a DLTensor
DLTensor TVMInitializeDLTensor(void *data, DLDataType* dtype, DLContext* ctx, int64_t* shape, int64_t num_dim);
 
// Helper function to run the `run_func` within the generated library network.o.
tvm_crt_error_t TVMRuntime_Run(tvm_model_t *model, DLTensor *inputs, int num_inputs, DLTensor *outputs, int num_outputs);

Internal API

The API to handle memory during the network execution will be mostly internal and not exposed to the user. The idea is to assume those two constants are defined:

#define AOT_MEMORY_NUM_PAGES (1<<10)
#define AOT_MEMORY_PAGE_SIZE_LOG2 12

And use them to instantiate a static memory area. There are currently projects to estimate the memory footprint of the graph directly from TVMC (see, the MicroTVM roadmap)

Self contained example

To make things clearer, below there is a more detailed example that shows (in a pseudo-C language) how we intend everything to fit together. Please note that the library is supposed to be compiled with target=c --runtime=aot --link-param.

Code generation

In this section let’s have a look at what TVM would generate.

operators.c / lib.c

This contains the operators bodies and the body of _lookup_linked_param.

// lib.c
// Linked param lookup function definition
void _lookup_linked_param(TVMValue *,...) {}
 
 
// Operators definition
void fused_layout_transform_2(TVMValue *,...) {}
void fused_layout_transform_1(TVMValue *,...) {}
void fused_nn_contrib_conv2d_NCHWc_right_shift_cast(TVMValue *,...) {}

network.c

This file contains the declarations of the operators and the definition of run_func.

// network.c
// Linked param lookup function declaration
void _lookup_linked_param(TVMValue *,...);
 
// Operators declaration
void fused_layout_transform_2(TVMValue *,...);
void fused_layout_transform_1(TVMValue *,...);
void fused_nn_contrib_conv2d_NCHWc_right_shift_cast(TVMValue *,...);
 
 
// Main tvm__run_func (generated by TVM, which lives inside the library lib.o (or lib.c)
TVM_DLL int32_t tvm_run_func(TVMValue* values, ...,  void* resource_handle) {
    void* sid_3 = TVMBackendAllocWorkspace(1, 0, 32768, 2, 8);
 
    // Call to the first operator. Note as values[1], the output of the network,
    // is being used as an intermediate tensor by fused_layout_transform_2
    TVMValue tensors_0[2] = { values[0], values[1] };
    (void)fused_layout_transform_2(tensors, 2)
 
    // Call to the second operator
    TVMValue p0;
    (void)_lookup_linked_param(2, &p0);
    DLTensor sid_3_tensor = {.data = (*void) sid_3, ...};
    TVMValue tensors_1[3] = {values[1], &p0, {.v_handle = sid_3_tensor}};
    (void)fused_nn_contrib_conv2d_NCHWc_right_shift_cast(tensors, 3);
 
    // Call to the third operator
    TVMValue tensors_2[2] = {sid_3_tensor, values[1]};
    (void)fused_layout_transform_1(tensors, 2);
}
 
// Entry point wrapper (generated by TVM, also lives inside the library)
tvm_model_t network = {
    .run_func = _tvm_run_func;
    .num_input_tensors = 1;
    .num_output_tensors = 1;
}

Memory management

In this section we illustrate how the memory management side of the things will look like.

aot_platform.c

// aot_platform.c
#ifndef AOT_MEMORY_NUM_PAGES
#define AOT_MEMORY_NUM_PAGES (1<<10)
#endif
 
#ifndef AOT_MEMORY_PAGE_SIZE_LOG2
#define AOT_MEMORY_PAGE_SIZE_LOG2 12
#endif
 
static uint8 page_size_log2 = AOT_MEMORY_PAGE_SIZE_LOG2
static uint8_t g_aot_memory[AOT_MEMORY_NUM_PAGES * (1 << page_size_log2)];
static MemoryManagerInterface* g_memory_manager;
 
void* TVMBackendAllocWorkspace(int device_type, int device_id, uint64_t nbytes, int dtype_code_hint,
                               int dtype_bits_hint) {
  void* ptr = 0;
  DLContext ctx = {device_type, device_id};
  return g_memory_manager->Allocate(g_memory_manager, num_bytes, ctx, ptr);
}
 
int TVMBackendFreeWorkspace(int device_type, int device_id, void* ptr) {
  DLContext ctx = {device_type, device_id};
  return g_memory_manager->Free(g_memory_manager, ptr, ctx);
}
 
MemoryManagerCreate(&g_memory_manager, g_aot_memory, total_size, page_size){
   //copied from crt.
}

Shim layer exposed to the user

In this section we describe the shim interface layer used directly by the application.

aot_runtime.c

// aot_runtime.c
tvm_aot_error_t TVMRuntime_Run(tvm_model_t *model, DLTensor *inputs, int num_inputs, DLTensor *outputs, int num_outputs)
{
    MemoryManagerCreate(g_memory_manager, g_aot_memory, sizeof(g_aot_memory), AOT_MEMORY_PAGE_SIZE_LOG2);
     
    TVMValue tvm_values[num_inputs+num_outputs];
    int i = 0;
    for (; i<num_inputs; i++){
        tvm_values[i] = {.v_handle = inputs[i]};
    }
 
    for (; i<num_outputs; i++){
        tvm_values[i] = {.v_handle = outputs[i]};
    }
 
    model->run_func(tvm_values, ...);
}

Main application and compilation

In this section we will describe what the end user would write and how the shim would be invoked.

main.c

This file represent the main application written by the user

// main.c
#include <aot_runtime.h>
extern tvm_model_t * network;
 
int main()
{
    DLTensor input = TVMInitializeDLTensor(..);
    DLTensor output = TVMInitializeDLTensor(..);
    DLTensor inputs[1] = {input};
    DLTensor outputs[1] = {output};
    TVMRuntime_Run(network, inputs, 1, outputs, 1);
}

We can compile everything with a command similar to:

# Compilation
$ $(CC) main.c lib.c aot_runtime.c aot_memory.c -DAOT_MEMORY_NUM_PAGES=(1<<12)

Conclusions

In this RFC we outlined the different parts of our proposal. These can be categorized in two macro

Code generation We decided to generate a run_func function to issue calls into the operators in the library. The function won’t make use of the function registry and of any helper contained within the library network.o
Runtime API We decided to provide a wrapper library (not generated) to be used by the users in order to call into the main function and create the necessary data structures to be passed to it

Please share your thoughts/feedbacks!

giuseros · February 23, 2021, 10:57pm

@areusch @manupa-arm @Leo-arm @MarisaKirisame @monklof @jroesch @slyubomirsky @zhiics @ramana-arm @mjs

tkonolige · February 23, 2021, 11:42pm

I notice you talk entirely about the graph runtime here, but I see no mention of the relay VM. Have you though about how to include features from the relay vm in AOT (dynamic shape, dynamic control flow)? Also, I see mention on the tvm docs to a relay ahead-of-time compiler. I don’t know if this actually exists, but if it does, how does this AOT approach compare?

giuseros · February 23, 2021, 11:53pm

Hi @tkonolige,

for now we are not looking at the relay vm. This RFC is mostly an enabler for embedded environments where the json is prohibitive and the memory is a scarce resource.

I am not familiar with the relay vm, so I am not sure about the effort involved in supporting it.

About the relay ahead-of-time compiler, could you show me where is it mentioned in the docs? I had a look at it, and I believe it is cited as future work, so this RFC is actually describing what in the doc is named “relay ahead-of-time compiler”.

areusch · February 24, 2021, 1:20am

hi @giuseros,

Thanks for posting this RFC! Implementing AOT runtime will be a great addition to µTVM and for other use cases. Here are some thoughts on the approach so far:

 int32_t (*run_func)(void* args, void* arg_type_ids, int32_t num_args, void* out_ret_value, void* out_ret_tcode, void* resource_handle);
 uint32_t num_input_tensors;
 uint32_t num_output_tensors;
 int8_t use_accelerator;
} tvm_model_t
…

The boolean use_accelerator can be used in order to populate the resource_handle variable in case we need to pass OS specific resources down to the operators.

Would it be possible to do a first cut omitting accelerator support? I would like to contemplate how to configure accelerator instances, which I think should somewhat match the way we configure GraphRuntime (I.e. supply configuration data per-TVMContext). I also think we should consider how to supply non-NULL resource_handle, if this is needed in your BYOC. I think we may need some more motivating examples, and I’m not convinced a global flag would cut it here. Perhaps best to consider this in a separate RFC? I also have a related RFC I’ll be releasing around compiler output shortly, which may help here.

Please note that we don’t need to specify --system-lib anymore, since the system library won’t be included in the generated library.

It almost seems like this could be orthogonal to AOT–you could create an AOT module with --system-lib, but you don’t have to.

Unpacked calls

Our code generator would issue tir.extern calls, manually packing/unpacking the arguments for the different operators contained in the library (very similar to what happens in the lower_builtin pass). In this way, we are de facto bypassing the function registry.

When only the c or llvm code generator is in use (guaranteed true when BYOC isn’t in use) and the C runtime is used, then the C names of generated functions are controlled by CodegenC. In this case, it’s possible to call them directly with tir.call_extern. When targeting the C++ runtime, it’s a different story:

AOT would live in a tree of runtime::Module
Each runtime::Module is consulted in a tree DFS to find PackedFunc linked into the library
TVMBackendGetFuncFromEnv exists to help with this lookup

The FuncRegistry in the C runtime is meant to replace this tree lookup with a single function table. I think you’re right that it’s more important in the GraphRuntime or RPC case, but considering we would like to also target the C++ runtime, perhaps it would be good to start with tir.call_packed, and we could consider a follow-on to move to tir.call_extern for C runtime use case, if needed?

User API

I like that this runtime looks quite minimal. However, there is a separate Module-based model runtime interface RFC we should consider as well. In particular, this interface splits apart the setup (e.g. memory allocation) and run phases of inference. It would be great to see if we could implement this interface with AOT, either here or with runtime shims; or, whether changes to that interface would make that possible.

I do think in particular that avoiding the need to copy data to SetInput is a good thing, and that may not be contained within that interface. However, some broader changes could be made when implementing it in C, particularly around memory management.

The idea is to assume those two constants are defined:
#define AOT_MEMORY_NUM_PAGES (1<<10)
#define AOT_MEMORY_PAGE_SIZE_LOG2 12
And use them to instantiate a static memory area.

Could you speak a bit more about how you want to handle memory allocation in the initial implementation? Which DLTensor would need to be allocated from within the generated code? Who defines these constants?

Please share your thoughts/feedbacks!

One other thing:

Would the generated TIR AOT module be the runtime::Module instance (either LLVMModule or CSourceModule) returned from tvm.relay.build?

slyubomirsky · February 24, 2021, 1:25am

About the relay ahead-of-time compiler, could you show me where is it mentioned in the docs? I had a look at it, and I believe it is cited as future work, so this RFC is actually describing what in the doc is named “relay ahead-of-time compiler”.

This is the old AOT compiler, which lowers (front end) Relay code into C++, though it still calls into TVM’s runtime for operators. Is that what you had in mind?

giuseros · February 25, 2021, 5:59pm

Hi @areusch ,

Thanks for your comments! Before replying in-line, let me first clarify two things:

We are not changing the C runtime or the C++ runtime. We are creating a parallel runtime , namely AOT, which will live in src/runtime/aot. The user will specify –runtime=aot to access this runtime.
We are mainly targeting embedded scenarios, for now. Indeed, while for other environments the AOT is nice-to-have, for embedded platforms this is a must-have.

That said, let me reply to your points

Would it be possible to do a first cut omitting accelerator support?

Yes, this is fine. We can omit the boolean value for now, and work on this at a later stage. The main point, as you correctly spotted, is to understand how to populate the resource_handle in the call to the run_func

It almost seems like this could be orthogonal to AOT–you could create an AOT module with --system-lib, but you don’t have to.

Yes this is correct, but since we are trying to not use the packed calls to the function I am wondering why we would need to add it to the library. In other words, given we use tir.call_extern, why do you think we need a mapping [string-> function pointers] in the library?

The FuncRegistry in the C runtime is meant to replace this tree lookup with a single function table. I think you’re right that it’s more important in the GraphRuntime or RPC case, but considering we would like to also target the C++ runtime, perhaps it would be good to start with tir.call_packed, and we could consider a follow-on to move to tir.call_extern for C runtime use case, if needed?`

From what I understood the CodegenC path is used if we specify the c back-end, independently by the runtime. And also, independently by the runtime, all the operators will live in the same library. The only difference, when we specify –runtime=aot, is that we will have an additional function, namely run_func, which contains a series of calls like:

rv = fused_nn_contrib_conv2d_NCHWc_right_shift_cast(subcall_values, subcall_tcodes, 3 , &subcall_ret_value, &subcall_ret_tcode, NULL);

This will compile fine, since fused_nn_contrib_conv2d_NCHWc_right_shift_cast will live in the same translation unit, i.e., lib.o or lib.c (I am trying to avoid the word “module” here to not make confusion with the TVM modules). To be absolutely clear, let’s consider this code.

lib = tvm.relay.build(mod, target, params=params)
lib.lib.save('lib.o') # lib.lib.save('lib.c') if codegen target is c

If I execute nm lib.o I see that the functions are all there. I understand that in the JSON case we need a way to translate a string from the JSON to a function call in the library, and to achieve that translation (without dlopen) we need a function table embedded in the library. Since we are getting rid of the JSON, I don’t think we need this mapping any more.

About the RPC case, for now the main AOT requirement is deployability. To tune a given board we will stick with the C runtime, at least for now.

I like that this runtime looks quite minimal. However, there is a separate Module-based model runtime interface we should consider as well. In particular, this interface splits apart the setup (e.g. memory allocation) and run phases of inference. It would be great to see if we could implement this interface with AOT, either here or with runtime shims; or, whether changes to that interface would make that possible. I do think in particular that avoiding the need to copy data to SetInput is a good thing, and that may not be contained within that interface. However, some broader changes could be made when implementing it in C, particularly around memory management.

I did read that RFC, and this was my reasoning:

We are trying here to implement the basic of AOT. The main part will be in the code-generation. About the interface, we thought to propose a very minimal interface within a shim layer so that the user can easily deploy the network on an embedded device.
Once we get this right, we can implement more complex interfaces within the aot_runtime.h, and those interfaces can be offered to the user in the form of the Module-based interface or any other interface. The main thing here is to move the control code inside the library, and deliver the minimal API to use it

Could you speak a bit more about how you want to handle memory allocation in the initial implementation? Which DLTensor would need to be allocated from within the generated code? Who defines these constants?

Sure. So for now we will be essentially using crt memory allocator but as a copy living inside src/runtime/aot/. This is because the main scope of this RFC is to bring AoT compilation and later on we can take future steps to improve/provide “helper” allocators that are better than what is in crt.

So there will be a preallocated statically initialized buffer (whose size can default to some value, but can be changed manually by the user) and functions like TVMBackendAllocWorkspace will work on that buffer. The constants I mention concern the size of this buffer and this can be preset or directly provided by the user. On a later date this will need to be removed as the compiler should automatically produce the static size of the buffer it needs in entirety.

As for the DLTensors:

For the intermediate tensors we internally allocate through TVMBackendAllocWorkspace and then we can wrap the allocated memory around DLTensors (in the same spirit of lower_builtin.h).
For the I/O tensors the user initializes the input/output buffers and wrap them around DLTensors with a call to TVMInitializeDLTensor.
For the params, we are linking them in. So we would call _lookup_linked_param (still through an extern call) to get hold of the parameters

I modified the strawman image (in the RFC) with a proper self-contained example to show the overall flow. Please, let me know if that explains things more clearly.

Would the generated TIR AOT module be the runtime::Module instance (either LLVMModule or CSourceModule) returned from tvm.relay.build?`

I was thinking to have a separate module AOTModule that will import the different modules within it. This is in the same spirit of the Metadata module. As we use the metadata module to share the Function Registry among the different TVM modules, we will use the AOTModule to share the run_func among different TVM modules

giuseros · February 25, 2021, 6:06pm

Hi @slyubomirsky ,

Thanks for the pointer. However, this is a parallel approach in which you would generated the json, and use a python script to “translate” it to C.

We also evaluated this approach (which is, correct if I am wrong, not yet integrated in TVM). It would sort some issues, like getting rid of the JSON, but would leave the unified memory door closed.

That’s why we think that going Relay->TIR->{C, LLVM} would make the most sense. Not only we get rid of the Json, but we also open a possible pathway to a unified memory planner, since everything would be just TIR.

Please, let me know what you think, Giuseppe

areusch · February 25, 2021, 11:22pm

@giuseros thanks for your reply! I think this approach makes sense to me–I want to clarify a few more things.

First, we have unfortunately overloaded the word “runtime.” There are 2 different families of runtimes:

c and c++ runtime – describes the implementation of c_runtime_api.h and c_backend_api.h.
graph, vm, aot runtime – describes how the operator functions are invoked in a model. eventually, could be stated similarly to the above as “describes the implementation of the module-based model interface.” should really be called GraphExecutor or something, but that’s another topic.

I am actually going to send an RFC to propose we rename GraphRuntime and family to e.g. GraphExecutor this week.

for AOT runtime I agree we do not need JSON parsing or any of the underlying facilities it brings. However, given it seems like you’re planning to reuse the C-runtime memory allocator and interfaces in include/tvm/crt/platform.h, I think it would be great to continue using --runtime=c in the target string and create an additional flag or other tvm.relay.build() argument. I don’t know that the (graph) runtime specification belongs in the Target string.

The main point, as you correctly spotted, is to understand how to populate the resource_handle in the call to the run_func

Could you say why you need this set? Currently it’s always NULL. I think it would be great to develop a pattern to use it, but right now the most natural pattern is to set it to the TVMModule instance that contains the operator function.

Since we are getting rid of the JSON, I don’t think we need this mapping any more.

A couple of thoughts:

It would be nice to keep the logic for assembling PackedFunc args and handling return values in tir.call_packed. This way if we change the interface, we don’t have to look in too many places.
Mainly I’m trying to make sure that to simplify the compiler, we implement the same conceptual TIR on both C++ and C runtimes. In the C++ runtime, we use PackedFunc as a “calling convention” to avoid needing to effectively hardcode C in various code generators. For instance, when dispatching to a compute library e.g. CUDA, a PackedFunc serves as a sort of adapter glue layer between TVM and CUDA.
In the C++ runtime, not all PackedFunc live in the same runtime::Module. So, we need the string lookup to do a sort of “late-binding.” In the C runtime, you’re right that the primary use case for this late-binding is with the RPC server. Perhaps we should just change CodeGenC and CodeGenLLVM to implement tir.call_packed when targeting C runtime by calling the symbol directly with the PackedFunc API instead of invoking TVMBackendGetFuncFromEnv. Would this address your concerns?

The main thing here is to move the control code inside the library, and deliver the minimal API to use it

Ok, that makes sense.

I modified the strawman image (in the RFC) with a proper self-contained example to show the overall flow. Please, let me know if that explains things more clearly.

Yeah this makes sense. Sounds good to me.

I was thinking to have a separate module AOTModule that will import the different modules within it.

That also makes sense. I think my question was poorly worded before. Just confirming that, similar to MetadataModule, this would be lib, in the return value from graph_json, lib, params = tvm.relay.build()? At present, those things are wrapped in GraphRuntimeFactoryModule, and we’ll need to address that. I have another RFC forthcoming in a week or so to discuss changes there designed to support µTVM and accelerator use cases.

giuseros · February 26, 2021, 1:40am

Hi @areusch ,

Thanks for the interesting reply! I am going to be off tomorrow, so I will probably think about your reply over the (long) week-end and get back to you early next week

Thanks, Giuseppe

slyubomirsky · February 26, 2021, 4:01am

I agree that going through TIR is a better way and will definitely allow for finer-grained control.

giuseros · March 4, 2021, 3:27pm

Hi Andrew,

for AOT runtime I agree we do not need JSON parsing or any of the underlying facilities it brings. However, given it seems like you’re planning to reuse the C-runtime memory allocator and interfaces in include/tvm/crt/platform.h, I think it would be great to continue using --runtime=c in the target string and create an additional flag or other tvm.relay.build() argument. I don’t know that the (graph) runtime specification belongs in the Target string.

Thanks for this clarification. Yes, this interface is fine for now. About the implementation we will have aot_runtime.h in a separate src/runtime/aot folder which will #include the crt memory manager from src/runtime/crt, for now. In future we will make a memory manager specific for AOT (possibly code generating information like the required memory to run the network).

Could you say why you need this set? Currently it’s always NULL. I think it would be great to develop a pattern to use it, but right now the most natural pattern is to set it to the TVMModule instance that contains the operator function.

So the short answer is that we don’t have a clear idea yet. But we were hoping to actually develop a pattern to use it, as you suggest. That’s though something I think deserves a separate and more detailed discussion

It would be nice to keep the logic for assembling PackedFunc args and handling return values in tir.call_packed. This way if we change the interface, we don’t have to look in too many places.

Mainly I’m trying to make sure that to simplify the compiler, we implement the same conceptual TIR on both C++ and C runtimes. In the C++ runtime, we use PackedFunc as a “calling convention” to avoid needing to effectively hardcode C in various code generators. For instance, when dispatching to a compute library e.g. CUDA, a PackedFunc serves as a sort of adapter glue layer between TVM and CUDA.

In the C++ runtime, not all PackedFunc live in the same runtime::Module. So, we need the string lookup to do a sort of “late-binding.” In the C runtime, you’re right that the primary use case for this late-binding is with the RPC server. Perhaps we should just change CodeGenC and CodeGenLLVM to implement tir.call_packed when targeting C runtime by calling the symbol directly with the PackedFunc API instead of invoking TVMBackendGetFuncFromEnv. Would this address your concerns?

Yes, I like this approach. Basically we get rid of the system library in c, but not of the dynamic system library in c++ (where it probably is less of an issue). This means this work could possibly be extended to support c++ runtime in the future.

That also makes sense. I think my question was poorly worded before. Just confirming that, similar to MetadataModule, this would be lib, in the return value from graph_json, lib, params = tvm.relay.build()? At present, those things are wrapped in GraphRuntimeFactoryModule, and we’ll need to address that. I have another RFC forthcoming in a week or so to discuss changes there designed to support µTVM and accelerator use cases.

Yes, this exactly what I meant. I am looking forward to the RFC!

Thanks,

Giuseppe

areusch · March 4, 2021, 10:50pm

hi @giuseros,

About the implementation we will have aot_runtime.h in a separate src/runtime/aot folder

Would it be possible to create just a library e.g. src/runtime/crt/aot_executor? This will make things less complicated when the C runtime is distributed with a TVM wheel.

So the short answer is that we don’t have a clear idea yet. But we were hoping to actually develop a pattern to use it, as you suggest. That’s though something I think deserves a separate and more detailed discussion

Okay that seems reasonable. I think there are definitely some good use cases for resource_handle, but want to make sure the abstraction is at the right level.

Basically we get rid of the system library in c, but not of the dynamic system library in c++ (where it probably is less of an issue). This means this work could possibly be extended to support c++ runtime in the future.

Yeah I think having a few implementation of tir.call_packed may provide more opportunities for future development. cc @tqchen for more thoughts here.

It would be nice to contemplate how we might be able to keep compatibility with --system-lib even if it may be overkill in some situations. I think a small C wrapper that effectively implements a tir.call_packed to instantiate the model could be one way to do this. We also don’t need to settle on this before making a first implementation of AOT in TIR.

Yes, this exactly what I meant. I am looking forward to the RFC!

Great, I’m iterating on this a bit and hope to post it now next week.

giuseros · April 1, 2021, 4:34pm

Hi all, I was finally able to have a first version of the AOT work in a PR upstream.

PR

You can find the PR here: [AOT] Introducing AOT in TVM by giuseros · Pull Request #7785 · apache/tvm · GitHub

At this stage, I gladly accept any feedback on things that can be improved in the PR or on issues I might have overlooked. Please, help me smoothing the edges of this work

Limitations

There are two main limitation of the current work:

We didn’t add support for LLVM codegeneration. This is because we thought better to agree on the overall picture first using the c backend as POC, and then taking care of the LLVM backend
We didn’t include support for LetNode in the aot_codegen. Support for the LetNode is in the pipeline and will be added soon

Next steps

Bear in mind that this is only the first step of a journey. We are currently working on different improvements to AOT, in particular:

LLVM support LLVM support is currently being worked on and we are almost there
Name mangling We are adding name mangling into the picture, i.e., the user should be able to specify a prefix and this prefix should be added to all the global names used in the library. In this way, we will enable the user to build and link more than one network in the same application.
DLTensor surgery Since the memory allocation is done statically, we don’t need to carry DLTensor through the generated code, as it exposes metadata that are not consumed by the codegen and that increases the size of the binary image to be flashed on the microcontroller
Unpack the runner function signature Change the API of the runner function. Indeed, we would like the runner function to not have a packed API signature. This is to avoid instantiating type_ids or forcing a dynamic size of the function stack (all things that don’t add benefits in the embedded space, but take a toll in terms of code size, performance and power)
int64_t surgery Using int64_t on embedded devices usually increases in register spilling, which means power and performance will be heavily affected. We are removing this datatype in every place it’s being used.
Remove param lookup through __lookup_linked_param: in order to make things simple, we are currently reusing the __lookup_linked_param function to access the parameters in the library. However, with AOT we can simply create a TIR builtin that accesses the parameters directly without going through the issues of a function invocation. This is still with the aim of saving power, performance and space.

cc: @ramana-arm @manupa-arm @areusch @mbaret @stoa @mjs

giuseros · April 1, 2021, 4:42pm

FYI: I will be out for Easter holidays until Tuesday (so I will be replying back to any comments as soon as I come back )

areusch · April 7, 2021, 6:47pm

Hi @giuseros, @manupa-arm,

I wanted to discuss one higher-level topic from the PR here: memory planning. Currently the AOT PR also implements some memory planning in its tvm_backend.c. I think it’d be great to separate that from the AOT PR and continue to use TVMBackendAllocWorkspace, even though it’s less efficient. The main reason for this is that we’re concurrently beginning to work on the Graph Memory Planner and I think it makes sense to handle all of the tensor pinning at that level, and produce some configuration that the executor can consume to decide where to place DLTensor at runtime.

This is fairly complex so we’ll release another RFC at some point in the future. What’re your thoughts here?

-Andrew

giuseros · April 7, 2021, 6:58pm

Hi @areusch , Just to be clear, we are not doing memory planning in the current AOT

What you see in tvm_backend.c is a memory allocator. Instead of going through the complex page allocator needed with by the graph executor, we thought to implement a simpler one for aot, that behaves like a stack (with a LIFO policy).

This can be proved to work, because in AoT we allocate the storage identifiers through let statements and the internal operators also use let, so everything is LIFO and the stack works.

This couldn’t work with the graph executor mostly because of the JSON (which was using the same allocator but was not following a LIFO convention)

As a side note, we are also planning work on a global memory planner, so it would be good to catch up at some point in order to reduce overlap.

Thanks,

Andrew

areusch · April 7, 2021, 7:12pm

@giuseros on microTVM, the actual implementation (when using TVMBackendAllocWorkspace) is left up to TVMPlatformMemoryAllocate. Would it be possible to move the lifo impl behind this call? This would make it easier to try the AOT executor in other non-micro use cases.

Agree we should discuss about global memory planning at some point soon.

tqchen · April 12, 2021, 6:38pm

Thanks everyone for great discussion so far and the initial AOT PoC. Thanks @giuseros and others for bringing in the first PoC. I finally get time to look into the proposed changes, these are great work.

My main comments so far have things to do with interface design and how to make things in an architecture consistent way.

Specifically, it would be great to think about the general API design and consolidation. In particular, we should de-couple the implementation of the API(AOT vs interpreter based) from the API interface design.

Ideally a user should use a similar compilation path for compiling(except for a different flag), exportation and then load a AOT module

Right now we can see are a few variants of ways to expose the model generated by AOT:

W0: Through runtime.Module and PackedFunc interface, the executor is a runtime.Module which contains three packed functions(set/get/run), this is in alignment with the Module based runtime interface mentioned in the previous
W1a: A standardized C API for graph/aot model execution only in C runtime.
W1b: A standardized C API runtime that wraps the module-based API(W0) and exposes a higher level API to the user.
W2: A separate C API that allows direct invocation of the generated model, specifically for AOT

From W2 => W1 => W0 there are different levels of standardization being involved.

For example, if AOT generates the code that obeys the W0 convention, then we can naturally test the result generated by AOT directly through python, run the code through RPC using the current set of infrastructure. The AOT tutorial can then directly sits insides the uTVM tutorials via python.

W1a and W1b are similar to each other(from the users’ PoV), except that in the case of W1b, W0 was the first class citizen, and the common part. W1a models things in another round. Finally W2 means the developers need to be aware of the backend that is being used.

Given the importance of embedded setting, I think it is useful to have some form of W1(a or b), that allows users to directly have a set of common convention for C runtime. However, such API ideally should not be AOT specific, but instead the “official” way to use all generated results in C API.

I also think it would be useful to always start by thinking about W0 support. Although W0 introduced an indirection(e.g. run function can be a direct C API instead of a PackedFunc), we already used PackedFunc through the per operator function, using PackedFunc for the general case won’t add too much of an overhead, but would enable the benefit mentioned above.

Would love to get everyone’s take, in terms of (1) engineering feasibility/ overhead of the Ws, (2) preference of the interface choice.

giuseros · April 13, 2021, 5:26pm

Hi @tqchen,

The main issue here is about the fact that we are targeting embedded environments. I am not a deep embedded expert (@mjs , @ramana-arm feel free to chime in), but my understanding is that the runtime API we offer to embedded developers needs to be quite minimal. Basically, we want to save also the single byte in order to fit in the limited space embedded devices provide.

So, given that we see AOT as the first step toward tiny devices, we opted for W1a, basically. Our preference, as things move forward, would be to have a tiny specific runtime interface that embedded developers can use that does not rely on large data structures (e.g., TVMValue or DLTensor) and that involves the minimalist set of #includes. So basically we are thinking along the line of W2 for embedded.

While I understand the benefits of a general interface, if we want to be comparable to embedded compilers (e.g., see https://arxiv.org/pdf/2007.10319.pdf) I think the need to abstract such a “tiny” interface is appropriate.

I am not against to future generalizations of the interface (i.e., W0 → W1b), but I think we can defer these to a later stage (also because they seem independent from the PR that is upstream), while focusing on embedded (W1a → W2) for now.