Implementing AOT in TVM

Motivation

In its current state TVM compilation flow produces two (or optionally three) artifacts:

  1. The library containing the operators
  2. The parameters of the networks (i.e., the weights of the model). This has been made optional with the --link-params option to the host target.
  3. A json file describing the control flow of the network. i.e., which operators should be run.

Generally speaking, the TVM runtime consists in an interpreter reading the json file (3) and - optionally - the weights (2) and calling the operators contained into the library (1). This mechanism is described in the following picture:

While the params.bin is optional (i.e., we can omit it if we provide a --link-params flag to the compilation, see this RFC), the json file network.json is mandatory.

To be clear, there is actually no flow that allows the user to not provide the json file to the runtime.

This is a problem for two main reasons:

  • This workflow is very hard to implement on a micro-controller, since memory is usually a costly resource in embedded environments, and the json file is usually quite large.
  • We have a split memory allocation in the current TVM stack, where inter-operator memory is managed at json/relay level while the intra-operator memory is managed at TIR level

We at Arm are working on an AOT flow to get rid of the Json, and to transform the above graph into the following single-artifact scenario:

The user can optionally specify the name of the network, so that we can have a network_resnet.o, network_mobilenet.o, etc… For this RFC we will refer to a generic network.o (as shown in the picture).

The idea in the above image is that the network.o will expose a runner function which will take care of calling the different operators in the same library. We will temporarily call this function run_func, but naming is something that we will need to define.

The aim of this RFC is to provide a source of discussion on different topics that can help us through the development of this feature. The main topics of interests are

  • Code generation (i.e., IRModule + runtime::module generation)
  • Runtime API

Code generation

The aim of code generation is to go from a Relay graph to a runtime::module containing the control execution of the graph. We split this process in two different parts:

  • runtime::Module generation
  • runtime::Module bundling

TIR code generation

In recent times TIR has been augmented with runtime functionalities (e.g., the possibility to return a value ) which makes it ready to handle runtime code like creating NDArrays, NDShapes, calling functions, returning values, etc…

This solution provides different benefits:

  • We would be reusing a lot of TIR functionalities (less code duplication)
  • We can set the foundations to implement a global memory planner, thus reducing the memory footprint of the network (which is quite valuable for microcontrollers)

The signature of the run_func generated by the TIR code generator would be the same of any packed function:

int32_t run_func(void* args, void* arg_type_ids, int32_t num_args, void* out_ret_value, void* out_ret_tcode, void* resource_handle);

In the following sub-sections we highlight some details about the code generation process.

Unpacked calls

Our code generator would issue tir.extern calls, manually packing/unpacking the arguments for the different operators contained in the library (very similar to what happens in the lower_builtin pass). In this way, we are de facto bypassing the function registry.

Runner descriptor

While it would be possible to directly expose run_func in the generated library we would wrap this function around a descriptor, i.e., a struct with the following fields:

typedef struct {
  int32_t (*run_func)(void* args, void* arg_type_ids, int32_t num_args, void* out_ret_value, void* out_ret_tcode, void* resource_handle);
  uint32_t num_input_tensors;
  uint32_t num_output_tensors;
} tvm_model_t

Having run_func wrapped within a descriptor provides with different benefits:

  • We can use the fields of the descriptor as a mean for the application to check the sanity of the arguments passed to the run_func
  • This will be the only entry point that needs to be exposed by the library network.o

Name mangling

TVM codegen should not invade the application symbol namespace and use the “implementation defined” namespace , which in C and C++ like languages (or indeed in Elf symbol land) is any symbol name prefixed with a _ . Further symbol names should be unique so that multiple models compiled can be statically linked into the same application. This can be achieved with the following changes:

  • The user will specify a name for the network to compile, and the global names will be suffixed with this name
  • The inner operators and the run_func will be declared “static” within the library. In this case we shield them from the outside world and we only expose the tvm_model_t entry point (which will be properly suffixed).

Parameters

For now we will assume that the parameters are linked within the library: in other words the flag --link-params is mandatory with the AOT flow.

Bundling all the modules together

In our design we will store the generated IRModule as an additional field of the LoweredOutput data structure.

struct LoweredOutput {
  std::string graph_json;
  Map<String, IRModule> lowered_funcs;
  IRModule aot_runtime; // This is the module generated that contains network_runner code
  Array<tvm::runtime::Module> external_mods;
  std::unordered_map<std::string, std::pair<int, const tvm::runtime::NDArray>> params;
};

We can then pass the aot_runtime module to CreateMetadataModule :

aot_ir_module = getAOTModule();
auto aot_mod = tvm::build(aot_module, target_host, target_host);
ret_.mod = tvm::codegen::CreateAOTModule(ret_.params, ret_.mod, ext_mods, aot_mod, GetTargetHost());

In the above snippet of code, the function CreateAOTModule will take care of adding the run_func definition in the library and will import the other modules (so that run_func will be common to all the modules).

Target host specification

To kick in the AOT flow we propose to add an additional runtime, namely aot, to the list of existing runtimes available in the target host.

The target host to generate an AOT-ready library would look like:

target_host = 'c  --runtime=aot --link-params'

Please note that we don’t need to specify --system-lib anymore, since the system library won’t be included in the generated library.

Runtime changes

This section is about how we can expose to the user the content of the generated library network.o.

Our idea is to create an additional aot_runtime folder which would live next to the crt and graph runtime folders. In this way all the other flows will still be available and unchanged, and in the meanwhile we can gradually extend the aot runtime to support different use cases.

Before we move on in this section, let’s clarify the difference between the aot_runtime and the graph_runtime:

  • Graph runtime - is the runtime used to read the json and to call the operators within the library.
  • AOT runtime - this represents a shim layer containing helper functions to carry on the execution of the network

Graph runtime removal

The graph runtime in the current state takes care of:

  1. Initializing the Function Registry
  2. Initializing the memory manager
  3. Reading the json and calling into the functions defined in the generated library

With the AOT flow we got rid of (3), and by issuing unpacked calls we avoid the use of the Function Registry(1). The memory handling can be pushed directly into the aot runtime.

To be absolutely clear, we won’t need any Graph Runtime within the aot flow, since this is provided already by the generated library.

AOT runtime

The AOT runtime represents the shim layer provided to the user to invoke the given network compiled in the generated library. The API should include:

  • Memory handling (which is traditionally part of the Graph Runtime, which we removed).
  • Helpers to create DLTensors
  • Helpers to invoke run_func inside the generated library

We will be developing the AOT runtime as a C API, so that it will be easy to deploy AOT flows on embedded devices.

It would not be extremely hard in the future to add a C++ API.

User API

Let’s try to flash out what the aot runtime user API should look like below:

// Helper function to initialize a DLTensor
DLTensor TVMInitializeDLTensor(void *data, DLDataType* dtype, DLContext* ctx, int64_t* shape, int64_t num_dim);
 
// Helper function to run the `run_func` within the generated library network.o.
tvm_crt_error_t TVMRuntime_Run(tvm_model_t *model, DLTensor *inputs, int num_inputs, DLTensor *outputs, int num_outputs);

Internal API

The API to handle memory during the network execution will be mostly internal and not exposed to the user. The idea is to assume those two constants are defined:

#define AOT_MEMORY_NUM_PAGES (1<<10)
#define AOT_MEMORY_PAGE_SIZE_LOG2 12

And use them to instantiate a static memory area. There are currently projects to estimate the memory footprint of the graph directly from TVMC (see, the MicroTVM roadmap)

Self contained example

To make things clearer, below there is a more detailed example that shows (in a pseudo-C language) how we intend everything to fit together. Please note that the library is supposed to be compiled with target=c --runtime=aot --link-param.

Code generation

In this section let’s have a look at what TVM would generate.

operators.c / lib.c

This contains the operators bodies and the body of _lookup_linked_param.

// lib.c
// Linked param lookup function definition
void _lookup_linked_param(TVMValue *,...) {}
 
 
// Operators definition
void fused_layout_transform_2(TVMValue *,...) {}
void fused_layout_transform_1(TVMValue *,...) {}
void fused_nn_contrib_conv2d_NCHWc_right_shift_cast(TVMValue *,...) {}

network.c

This file contains the declarations of the operators and the definition of run_func.

// network.c
// Linked param lookup function declaration
void _lookup_linked_param(TVMValue *,...);
 
// Operators declaration
void fused_layout_transform_2(TVMValue *,...);
void fused_layout_transform_1(TVMValue *,...);
void fused_nn_contrib_conv2d_NCHWc_right_shift_cast(TVMValue *,...);
 
 
// Main tvm__run_func (generated by TVM, which lives inside the library lib.o (or lib.c)
TVM_DLL int32_t tvm_run_func(TVMValue* values, ...,  void* resource_handle) {
    void* sid_3 = TVMBackendAllocWorkspace(1, 0, 32768, 2, 8);
 
    // Call to the first operator. Note as values[1], the output of the network,
    // is being used as an intermediate tensor by fused_layout_transform_2
    TVMValue tensors_0[2] = { values[0], values[1] };
    (void)fused_layout_transform_2(tensors, 2)
 
    // Call to the second operator
    TVMValue p0;
    (void)_lookup_linked_param(2, &p0);
    DLTensor sid_3_tensor = {.data = (*void) sid_3, ...};
    TVMValue tensors_1[3] = {values[1], &p0, {.v_handle = sid_3_tensor}};
    (void)fused_nn_contrib_conv2d_NCHWc_right_shift_cast(tensors, 3);
 
    // Call to the third operator
    TVMValue tensors_2[2] = {sid_3_tensor, values[1]};
    (void)fused_layout_transform_1(tensors, 2);
}
 
// Entry point wrapper (generated by TVM, also lives inside the library)
tvm_model_t network = {
    .run_func = _tvm_run_func;
    .num_input_tensors = 1;
    .num_output_tensors = 1;
}

Memory management

In this section we illustrate how the memory management side of the things will look like.

aot_platform.c

// aot_platform.c
#ifndef AOT_MEMORY_NUM_PAGES
#define AOT_MEMORY_NUM_PAGES (1<<10)
#endif
 
#ifndef AOT_MEMORY_PAGE_SIZE_LOG2
#define AOT_MEMORY_PAGE_SIZE_LOG2 12
#endif
 
static uint8 page_size_log2 = AOT_MEMORY_PAGE_SIZE_LOG2
static uint8_t g_aot_memory[AOT_MEMORY_NUM_PAGES * (1 << page_size_log2)];
static MemoryManagerInterface* g_memory_manager;
 
void* TVMBackendAllocWorkspace(int device_type, int device_id, uint64_t nbytes, int dtype_code_hint,
                               int dtype_bits_hint) {
  void* ptr = 0;
  DLContext ctx = {device_type, device_id};
  return g_memory_manager->Allocate(g_memory_manager, num_bytes, ctx, ptr);
}
 
int TVMBackendFreeWorkspace(int device_type, int device_id, void* ptr) {
  DLContext ctx = {device_type, device_id};
  return g_memory_manager->Free(g_memory_manager, ptr, ctx);
}
 
MemoryManagerCreate(&g_memory_manager, g_aot_memory, total_size, page_size){
   //copied from crt.
}

Shim layer exposed to the user

In this section we describe the shim interface layer used directly by the application.

aot_runtime.c

// aot_runtime.c
tvm_aot_error_t TVMRuntime_Run(tvm_model_t *model, DLTensor *inputs, int num_inputs, DLTensor *outputs, int num_outputs)
{
    MemoryManagerCreate(g_memory_manager, g_aot_memory, sizeof(g_aot_memory), AOT_MEMORY_PAGE_SIZE_LOG2);
     
    TVMValue tvm_values[num_inputs+num_outputs];
    int i = 0;
    for (; i<num_inputs; i++){
        tvm_values[i] = {.v_handle = inputs[i]};
    }
 
    for (; i<num_outputs; i++){
        tvm_values[i] = {.v_handle = outputs[i]};
    }
 
    model->run_func(tvm_values, ...);
}

Main application and compilation

In this section we will describe what the end user would write and how the shim would be invoked.

main.c

This file represent the main application written by the user

// main.c
#include <aot_runtime.h>
extern tvm_model_t * network;
 
int main()
{
    DLTensor input = TVMInitializeDLTensor(..);
    DLTensor output = TVMInitializeDLTensor(..);
    DLTensor inputs[1] = {input};
    DLTensor outputs[1] = {output};
    TVMRuntime_Run(network, inputs, 1, outputs, 1);
}

We can compile everything with a command similar to:

# Compilation
$ $(CC) main.c lib.c aot_runtime.c aot_memory.c -DAOT_MEMORY_NUM_PAGES=(1<<12)

Conclusions

In this RFC we outlined the different parts of our proposal. These can be categorized in two macro

  • Code generation We decided to generate a run_func function to issue calls into the operators in the library. The function won’t make use of the function registry and of any helper contained within the library network.o
  • Runtime API We decided to provide a wrapper library (not generated) to be used by the users in order to call into the main function and create the necessary data structures to be passed to it

Please share your thoughts/feedbacks!

4 Likes

@areusch @manupa-arm @Leo-arm @MarisaKirisame @monklof @jroesch @slyubomirsky @zhiics @ramana-arm @mjs

I notice you talk entirely about the graph runtime here, but I see no mention of the relay VM. Have you though about how to include features from the relay vm in AOT (dynamic shape, dynamic control flow)? Also, I see mention on the tvm docs to a relay ahead-of-time compiler. I don’t know if this actually exists, but if it does, how does this AOT approach compare?

1 Like

Hi @tkonolige,

for now we are not looking at the relay vm. This RFC is mostly an enabler for embedded environments where the json is prohibitive and the memory is a scarce resource.

I am not familiar with the relay vm, so I am not sure about the effort involved in supporting it.

About the relay ahead-of-time compiler, could you show me where is it mentioned in the docs? I had a look at it, and I believe it is cited as future work, so this RFC is actually describing what in the doc is named “relay ahead-of-time compiler”.

hi @giuseros,

Thanks for posting this RFC! Implementing AOT runtime will be a great addition to µTVM and for other use cases. Here are some thoughts on the approach so far:

 int32_t (*run_func)(void* args, void* arg_type_ids, int32_t num_args, void* out_ret_value, void* out_ret_tcode, void* resource_handle);
 uint32_t num_input_tensors;
 uint32_t num_output_tensors;
 int8_t use_accelerator;
} tvm_model_t

…

  • The boolean use_accelerator can be used in order to populate the resource_handle variable in case we need to pass OS specific resources down to the operators.

Would it be possible to do a first cut omitting accelerator support? I would like to contemplate how to configure accelerator instances, which I think should somewhat match the way we configure GraphRuntime (I.e. supply configuration data per-TVMContext). I also think we should consider how to supply non-NULL resource_handle, if this is needed in your BYOC. I think we may need some more motivating examples, and I’m not convinced a global flag would cut it here. Perhaps best to consider this in a separate RFC? I also have a related RFC I’ll be releasing around compiler output shortly, which may help here.

Please note that we don’t need to specify --system-lib anymore, since the system library won’t be included in the generated library.

It almost seems like this could be orthogonal to AOT–you could create an AOT module with --system-lib, but you don’t have to.

Unpacked calls

Our code generator would issue tir.extern calls, manually packing/unpacking the arguments for the different operators contained in the library (very similar to what happens in the lower_builtin pass). In this way, we are de facto bypassing the function registry.

When only the c or llvm code generator is in use (guaranteed true when BYOC isn’t in use) and the C runtime is used, then the C names of generated functions are controlled by CodegenC. In this case, it’s possible to call them directly with tir.call_extern. When targeting the C++ runtime, it’s a different story:

  • AOT would live in a tree of runtime::Module
  • Each runtime::Module is consulted in a tree DFS to find PackedFunc linked into the library
  • TVMBackendGetFuncFromEnv exists to help with this lookup

The FuncRegistry in the C runtime is meant to replace this tree lookup with a single function table. I think you’re right that it’s more important in the GraphRuntime or RPC case, but considering we would like to also target the C++ runtime, perhaps it would be good to start with tir.call_packed, and we could consider a follow-on to move to tir.call_extern for C runtime use case, if needed?

User API

I like that this runtime looks quite minimal. However, there is a separate Module-based model runtime interface RFC we should consider as well. In particular, this interface splits apart the setup (e.g. memory allocation) and run phases of inference. It would be great to see if we could implement this interface with AOT, either here or with runtime shims; or, whether changes to that interface would make that possible.

I do think in particular that avoiding the need to copy data to SetInput is a good thing, and that may not be contained within that interface. However, some broader changes could be made when implementing it in C, particularly around memory management.

The idea is to assume those two constants are defined:

#define AOT_MEMORY_NUM_PAGES (1<<10)
#define AOT_MEMORY_PAGE_SIZE_LOG2 12

And use them to instantiate a static memory area.

Could you speak a bit more about how you want to handle memory allocation in the initial implementation? Which DLTensor would need to be allocated from within the generated code? Who defines these constants?

Please share your thoughts/feedbacks!

One other thing:

  • Would the generated TIR AOT module be the runtime::Module instance (either LLVMModule or CSourceModule) returned from tvm.relay.build?

About the relay ahead-of-time compiler, could you show me where is it mentioned in the docs? I had a look at it, and I believe it is cited as future work, so this RFC is actually describing what in the doc is named “relay ahead-of-time compiler”.

This is the old AOT compiler, which lowers (front end) Relay code into C++, though it still calls into TVM’s runtime for operators. Is that what you had in mind?

Hi @areusch ,

Thanks for your comments! Before replying in-line, let me first clarify two things:

  • We are not changing the C runtime or the C++ runtime. We are creating a parallel runtime , namely AOT, which will live in src/runtime/aot. The user will specify –runtime=aot to access this runtime.
  • We are mainly targeting embedded scenarios, for now. Indeed, while for other environments the AOT is nice-to-have, for embedded platforms this is a must-have.

That said, let me reply to your points

Would it be possible to do a first cut omitting accelerator support?

Yes, this is fine. We can omit the boolean value for now, and work on this at a later stage. The main point, as you correctly spotted, is to understand how to populate the resource_handle in the call to the run_func

It almost seems like this could be orthogonal to AOT–you could create an AOT module with --system-lib, but you don’t have to.

Yes this is correct, but since we are trying to not use the packed calls to the function I am wondering why we would need to add it to the library. In other words, given we use tir.call_extern, why do you think we need a mapping [string-> function pointers] in the library?

The FuncRegistry in the C runtime is meant to replace this tree lookup with a single function table. I think you’re right that it’s more important in the GraphRuntime or RPC case, but considering we would like to also target the C++ runtime, perhaps it would be good to start with tir.call_packed, and we could consider a follow-on to move to tir.call_extern for C runtime use case, if needed?`

From what I understood the CodegenC path is used if we specify the c back-end, independently by the runtime. And also, independently by the runtime, all the operators will live in the same library. The only difference, when we specify –runtime=aot, is that we will have an additional function, namely run_func, which contains a series of calls like:

rv = fused_nn_contrib_conv2d_NCHWc_right_shift_cast(subcall_values, subcall_tcodes, 3 , &subcall_ret_value, &subcall_ret_tcode, NULL);

This will compile fine, since fused_nn_contrib_conv2d_NCHWc_right_shift_cast will live in the same translation unit, i.e., lib.o or lib.c (I am trying to avoid the word “module” here to not make confusion with the TVM modules). To be absolutely clear, let’s consider this code.

lib = tvm.relay.build(mod, target, params=params)
lib.lib.save('lib.o') # lib.lib.save('lib.c') if codegen target is c

If I execute nm lib.o I see that the functions are all there. I understand that in the JSON case we need a way to translate a string from the JSON to a function call in the library, and to achieve that translation (without dlopen) we need a function table embedded in the library. Since we are getting rid of the JSON, I don’t think we need this mapping any more.

About the RPC case, for now the main AOT requirement is deployability. To tune a given board we will stick with the C runtime, at least for now.

I like that this runtime looks quite minimal. However, there is a separate Module-based model runtime interface we should consider as well. In particular, this interface splits apart the setup (e.g. memory allocation) and run phases of inference. It would be great to see if we could implement this interface with AOT, either here or with runtime shims; or, whether changes to that interface would make that possible. I do think in particular that avoiding the need to copy data to SetInput is a good thing, and that may not be contained within that interface. However, some broader changes could be made when implementing it in C, particularly around memory management.

I did read that RFC, and this was my reasoning:

  • We are trying here to implement the basic of AOT. The main part will be in the code-generation. About the interface, we thought to propose a very minimal interface within a shim layer so that the user can easily deploy the network on an embedded device.
  • Once we get this right, we can implement more complex interfaces within the aot_runtime.h, and those interfaces can be offered to the user in the form of the Module-based interface or any other interface. The main thing here is to move the control code inside the library, and deliver the minimal API to use it

Could you speak a bit more about how you want to handle memory allocation in the initial implementation? Which DLTensor would need to be allocated from within the generated code? Who defines these constants?

Sure. So for now we will be essentially using crt memory allocator but as a copy living inside src/runtime/aot/. This is because the main scope of this RFC is to bring AoT compilation and later on we can take future steps to improve/provide “helper” allocators that are better than what is in crt.

So there will be a preallocated statically initialized buffer (whose size can default to some value, but can be changed manually by the user) and functions like TVMBackendAllocWorkspace will work on that buffer. The constants I mention concern the size of this buffer and this can be preset or directly provided by the user. On a later date this will need to be removed as the compiler should automatically produce the static size of the buffer it needs in entirety.

As for the DLTensors:

  • For the intermediate tensors we internally allocate through TVMBackendAllocWorkspace and then we can wrap the allocated memory around DLTensors (in the same spirit of lower_builtin.h).
  • For the I/O tensors the user initializes the input/output buffers and wrap them around DLTensors with a call to TVMInitializeDLTensor.
  • For the params, we are linking them in. So we would call _lookup_linked_param (still through an extern call) to get hold of the parameters

I modified the strawman image (in the RFC) with a proper self-contained example to show the overall flow. Please, let me know if that explains things more clearly.

Would the generated TIR AOT module be the runtime::Module instance (either LLVMModule or CSourceModule) returned from tvm.relay.build?`

I was thinking to have a separate module AOTModule that will import the different modules within it. This is in the same spirit of the Metadata module. As we use the metadata module to share the Function Registry among the different TVM modules, we will use the AOTModule to share the run_func among different TVM modules

Hi @slyubomirsky ,

Thanks for the pointer. However, this is a parallel approach in which you would generated the json, and use a python script to “translate” it to C.

We also evaluated this approach (which is, correct if I am wrong, not yet integrated in TVM). It would sort some issues, like getting rid of the JSON, but would leave the unified memory door closed.

That’s why we think that going Relay->TIR->{C, LLVM} would make the most sense. Not only we get rid of the Json, but we also open a possible pathway to a unified memory planner, since everything would be just TIR.

Please, let me know what you think, Giuseppe

@giuseros thanks for your reply! I think this approach makes sense to me–I want to clarify a few more things.

First, we have unfortunately overloaded the word “runtime.” There are 2 different families of runtimes:

  • c and c++ runtime – describes the implementation of c_runtime_api.h and c_backend_api.h.
  • graph, vm, aot runtime – describes how the operator functions are invoked in a model. eventually, could be stated similarly to the above as “describes the implementation of the module-based model interface.” should really be called GraphExecutor or something, but that’s another topic.

I am actually going to send an RFC to propose we rename GraphRuntime and family to e.g. GraphExecutor this week.

for AOT runtime I agree we do not need JSON parsing or any of the underlying facilities it brings. However, given it seems like you’re planning to reuse the C-runtime memory allocator and interfaces in include/tvm/crt/platform.h, I think it would be great to continue using --runtime=c in the target string and create an additional flag or other tvm.relay.build() argument. I don’t know that the (graph) runtime specification belongs in the Target string.

The main point, as you correctly spotted, is to understand how to populate the resource_handle in the call to the run_func

Could you say why you need this set? Currently it’s always NULL. I think it would be great to develop a pattern to use it, but right now the most natural pattern is to set it to the TVMModule instance that contains the operator function.

Since we are getting rid of the JSON, I don’t think we need this mapping any more.

A couple of thoughts:

  1. It would be nice to keep the logic for assembling PackedFunc args and handling return values in tir.call_packed. This way if we change the interface, we don’t have to look in too many places.
  2. Mainly I’m trying to make sure that to simplify the compiler, we implement the same conceptual TIR on both C++ and C runtimes. In the C++ runtime, we use PackedFunc as a “calling convention” to avoid needing to effectively hardcode C in various code generators. For instance, when dispatching to a compute library e.g. CUDA, a PackedFunc serves as a sort of adapter glue layer between TVM and CUDA.
  3. In the C++ runtime, not all PackedFunc live in the same runtime::Module. So, we need the string lookup to do a sort of “late-binding.” In the C runtime, you’re right that the primary use case for this late-binding is with the RPC server. Perhaps we should just change CodeGenC and CodeGenLLVM to implement tir.call_packed when targeting C runtime by calling the symbol directly with the PackedFunc API instead of invoking TVMBackendGetFuncFromEnv. Would this address your concerns?

The main thing here is to move the control code inside the library, and deliver the minimal API to use it

Ok, that makes sense.

I modified the strawman image (in the RFC) with a proper self-contained example to show the overall flow. Please, let me know if that explains things more clearly.

Yeah this makes sense. Sounds good to me.

I was thinking to have a separate module AOTModule that will import the different modules within it.

That also makes sense. I think my question was poorly worded before. Just confirming that, similar to MetadataModule, this would be lib, in the return value from graph_json, lib, params = tvm.relay.build()? At present, those things are wrapped in GraphRuntimeFactoryModule, and we’ll need to address that. I have another RFC forthcoming in a week or so to discuss changes there designed to support µTVM and accelerator use cases.

Hi @areusch ,

Thanks for the interesting reply! I am going to be off tomorrow, so I will probably think about your reply over the (long) week-end and get back to you early next week

Thanks, Giuseppe

I agree that going through TIR is a better way and will definitely allow for finer-grained control.

Hi Andrew,

for AOT runtime I agree we do not need JSON parsing or any of the underlying facilities it brings. However, given it seems like you’re planning to reuse the C-runtime memory allocator and interfaces in include/tvm/crt/platform.h, I think it would be great to continue using --runtime=c in the target string and create an additional flag or other tvm.relay.build() argument. I don’t know that the (graph) runtime specification belongs in the Target string.

Thanks for this clarification. Yes, this interface is fine for now. About the implementation we will have aot_runtime.h in a separate src/runtime/aot folder which will #include the crt memory manager from src/runtime/crt, for now. In future we will make a memory manager specific for AOT (possibly code generating information like the required memory to run the network).

Could you say why you need this set? Currently it’s always NULL. I think it would be great to develop a pattern to use it, but right now the most natural pattern is to set it to the TVMModule instance that contains the operator function.

So the short answer is that we don’t have a clear idea yet. But we were hoping to actually develop a pattern to use it, as you suggest. That’s though something I think deserves a separate and more detailed discussion :slight_smile:

  1. It would be nice to keep the logic for assembling PackedFunc args and handling return values in tir.call_packed. This way if we change the interface, we don’t have to look in too many places.
  2. Mainly I’m trying to make sure that to simplify the compiler, we implement the same conceptual TIR on both C++ and C runtimes. In the C++ runtime, we use PackedFunc as a “calling convention” to avoid needing to effectively hardcode C in various code generators. For instance, when dispatching to a compute library e.g. CUDA, a PackedFunc serves as a sort of adapter glue layer between TVM and CUDA.
  3. In the C++ runtime, not all PackedFunc live in the same runtime::Module. So, we need the string lookup to do a sort of “late-binding.” In the C runtime, you’re right that the primary use case for this late-binding is with the RPC server. Perhaps we should just change CodeGenC and CodeGenLLVM to implement tir.call_packed when targeting C runtime by calling the symbol directly with the PackedFunc API instead of invoking TVMBackendGetFuncFromEnv. Would this address your concerns?

Yes, I like this approach. Basically we get rid of the system library in c, but not of the dynamic system library in c++ (where it probably is less of an issue). This means this work could possibly be extended to support c++ runtime in the future.

That also makes sense. I think my question was poorly worded before. Just confirming that, similar to MetadataModule, this would be lib, in the return value from graph_json, lib, params = tvm.relay.build()? At present, those things are wrapped in GraphRuntimeFactoryModule, and we’ll need to address that. I have another RFC forthcoming in a week or so to discuss changes there designed to support µTVM and accelerator use cases.

Yes, this exactly what I meant. I am looking forward to the RFC!

Thanks,

Giuseppe