Implementing AOT in TVM

areusch · February 24, 2021, 1:20am

Thanks for posting this RFC! Implementing AOT runtime will be a great addition to µTVM and for other use cases. Here are some thoughts on the approach so far:

 int32_t (*run_func)(void* args, void* arg_type_ids, int32_t num_args, void* out_ret_value, void* out_ret_tcode, void* resource_handle);
 uint32_t num_input_tensors;
 uint32_t num_output_tensors;
 int8_t use_accelerator;
} tvm_model_t
…

The boolean use_accelerator can be used in order to populate the resource_handle variable in case we need to pass OS specific resources down to the operators.

Would it be possible to do a first cut omitting accelerator support? I would like to contemplate how to configure accelerator instances, which I think should somewhat match the way we configure GraphRuntime (I.e. supply configuration data per-TVMContext). I also think we should consider how to supply non-NULL resource_handle, if this is needed in your BYOC. I think we may need some more motivating examples, and I’m not convinced a global flag would cut it here. Perhaps best to consider this in a separate RFC? I also have a related RFC I’ll be releasing around compiler output shortly, which may help here.

Please note that we don’t need to specify --system-lib anymore, since the system library won’t be included in the generated library.

It almost seems like this could be orthogonal to AOT–you could create an AOT module with --system-lib, but you don’t have to.

Unpacked calls

Our code generator would issue tir.extern calls, manually packing/unpacking the arguments for the different operators contained in the library (very similar to what happens in the lower_builtin pass). In this way, we are de facto bypassing the function registry.

When only the c or llvm code generator is in use (guaranteed true when BYOC isn’t in use) and the C runtime is used, then the C names of generated functions are controlled by CodegenC. In this case, it’s possible to call them directly with tir.call_extern. When targeting the C++ runtime, it’s a different story:

AOT would live in a tree of runtime::Module
Each runtime::Module is consulted in a tree DFS to find PackedFunc linked into the library
TVMBackendGetFuncFromEnv exists to help with this lookup

The FuncRegistry in the C runtime is meant to replace this tree lookup with a single function table. I think you’re right that it’s more important in the GraphRuntime or RPC case, but considering we would like to also target the C++ runtime, perhaps it would be good to start with tir.call_packed, and we could consider a follow-on to move to tir.call_extern for C runtime use case, if needed?

User API

I like that this runtime looks quite minimal. However, there is a separate Module-based model runtime interface RFC we should consider as well. In particular, this interface splits apart the setup (e.g. memory allocation) and run phases of inference. It would be great to see if we could implement this interface with AOT, either here or with runtime shims; or, whether changes to that interface would make that possible.

I do think in particular that avoiding the need to copy data to SetInput is a good thing, and that may not be contained within that interface. However, some broader changes could be made when implementing it in C, particularly around memory management.

The idea is to assume those two constants are defined:
#define AOT_MEMORY_NUM_PAGES (1<<10)
#define AOT_MEMORY_PAGE_SIZE_LOG2 12
And use them to instantiate a static memory area.

Could you speak a bit more about how you want to handle memory allocation in the initial implementation? Which DLTensor would need to be allocated from within the generated code? Who defines these constants?

Please share your thoughts/feedbacks!

One other thing:

Would the generated TIR AOT module be the runtime::Module instance (either LLVMModule or CSourceModule) returned from tvm.relay.build?