Hi @areusch ,
Thanks for your comments! Before replying in-line, let me first clarify two things:
- We are not changing the C runtime or the C++ runtime. We are creating a parallel runtime , namely AOT, which will live in
src/runtime/aot
. The user will specify–runtime=aot
to access this runtime. - We are mainly targeting embedded scenarios, for now. Indeed, while for other environments the AOT is nice-to-have, for embedded platforms this is a must-have.
That said, let me reply to your points
Would it be possible to
do
a first cut omitting accelerator support?
Yes, this is fine. We can omit the boolean value for now, and work on this at a later stage. The main point, as you correctly spotted, is to understand how to populate the resource_handle
in the call to the run_func
It almost seems like this could be orthogonal to AOT–you could create an AOT module with --system-lib, but you don’t have to.
Yes this is correct, but since we are trying to not use the packed calls to the function I am wondering why we would need to add it to the library. In other words, given we use tir.call_extern
, why do you think we need a mapping [string-> function pointers]
in the library?
The FuncRegistry in the C runtime is meant to replace this tree lookup with a single function table. I think you’re right that it’s more important in the GraphRuntime or RPC case, but considering we would like to also target the C++ runtime, perhaps it would be good to start with tir.call_packed, and we could consider a follow-on to move to tir.call_extern for C runtime use case, if
needed?`
From what I understood the CodegenC
path is used if we specify the c
back-end, independently by the runtime. And also, independently by the runtime, all the operators will live in the same library. The only difference, when we specify –runtime=aot
, is that we will have an additional function, namely run_func
, which contains a series of calls like:
rv = fused_nn_contrib_conv2d_NCHWc_right_shift_cast(subcall_values, subcall_tcodes,
3
, &subcall_ret_value, &subcall_ret_tcode, NULL);
This will compile fine, since fused_nn_contrib_conv2d_NCHWc_right_shift_cast
will live in the same translation unit, i.e., lib.o or lib.c (I am trying to avoid the word “module” here to not make confusion with the TVM modules). To be absolutely clear, let’s consider this code.
lib = tvm.relay.build(mod, target, params=params)
lib.lib.save('lib.o') # lib.lib.save('lib.c') if codegen target is c
If I execute nm lib.o
I see that the functions are all there. I understand that in the JSON case we need a way to translate a string from the JSON to a function call in the library, and to achieve that translation (without dlopen
) we need a function table embedded in the library. Since we are getting rid of the JSON, I don’t think we need this mapping any more.
About the RPC case, for now the main AOT requirement is deployability. To tune a given board we will stick with the C runtime, at least for now.
I like that this runtime looks quite minimal. However, there is a separate Module-based model runtime interface we should consider as well. In particular, this interface splits apart the setup (e.g. memory allocation) and run phases of inference. It would be great to see if we could implement this interface with AOT, either here or with runtime shims; or, whether changes to that interface would make that possible. I do think in particular that avoiding the need to copy data to SetInput is a good thing, and that may not be contained within that interface. However, some broader changes could be made when implementing it in C, particularly around memory management.
I did read that RFC, and this was my reasoning:
- We are trying here to implement the basic of AOT. The main part will be in the code-generation. About the interface, we thought to propose a very minimal interface within a shim layer so that the user can easily deploy the network on an embedded device.
- Once we get this right, we can implement more complex interfaces within the aot_runtime.h, and those interfaces can be offered to the user in the form of the Module-based interface or any other interface. The main thing here is to move the control code inside the library, and deliver the minimal API to use it
Could you speak a bit more about how you want to handle memory allocation in the initial implementation? Which DLTensor would need to be allocated from within the generated code? Who defines these constants?
Sure. So for now we will be essentially using crt memory allocator but as a copy living inside src/runtime/aot/. This is because the main scope of this RFC is to bring AoT compilation and later on we can take future steps to improve/provide “helper” allocators that are better than what is in crt.
So there will be a preallocated statically initialized buffer (whose size can default to some value, but can be changed manually by the user) and functions like TVMBackendAllocWorkspace
will work on that buffer. The constants I mention concern the size of this buffer and this can be preset or directly provided by the user. On a later date this will need to be removed as the compiler should automatically produce the static size of the buffer it needs in entirety.
As for the DLTensors:
- For the intermediate tensors we internally allocate through
TVMBackendAllocWorkspace
and then we can wrap the allocated memory around DLTensors (in the same spirit of lower_builtin.h). - For the I/O tensors the user initializes the input/output buffers and wrap them around DLTensors with a call to
TVMInitializeDLTensor
. - For the params, we are linking them in. So we would call
_lookup_linked_param
(still through an extern call) to get hold of the parameters
I modified the strawman image (in the RFC) with a proper self-contained example to show the overall flow. Please, let me know if that explains things more clearly.
Would the generated TIR AOT module be the runtime::Module instance (either LLVMModule or CSourceModule) returned from tvm.relay.build?`
I was thinking to have a separate module AOTModule that will import the different modules within it. This is in the same spirit of the Metadata module. As we use the metadata module to share the Function Registry among the different TVM modules, we will use the AOTModule to share the run_func
among different TVM modules