Implementing AOT in TVM

@giuseros thanks for your reply! I think this approach makes sense to me–I want to clarify a few more things.

First, we have unfortunately overloaded the word “runtime.” There are 2 different families of runtimes:

  • c and c++ runtime – describes the implementation of c_runtime_api.h and c_backend_api.h.
  • graph, vm, aot runtime – describes how the operator functions are invoked in a model. eventually, could be stated similarly to the above as “describes the implementation of the module-based model interface.” should really be called GraphExecutor or something, but that’s another topic.

I am actually going to send an RFC to propose we rename GraphRuntime and family to e.g. GraphExecutor this week.

for AOT runtime I agree we do not need JSON parsing or any of the underlying facilities it brings. However, given it seems like you’re planning to reuse the C-runtime memory allocator and interfaces in include/tvm/crt/platform.h, I think it would be great to continue using --runtime=c in the target string and create an additional flag or other tvm.relay.build() argument. I don’t know that the (graph) runtime specification belongs in the Target string.

The main point, as you correctly spotted, is to understand how to populate the resource_handle in the call to the run_func

Could you say why you need this set? Currently it’s always NULL. I think it would be great to develop a pattern to use it, but right now the most natural pattern is to set it to the TVMModule instance that contains the operator function.

Since we are getting rid of the JSON, I don’t think we need this mapping any more.

A couple of thoughts:

  1. It would be nice to keep the logic for assembling PackedFunc args and handling return values in tir.call_packed. This way if we change the interface, we don’t have to look in too many places.
  2. Mainly I’m trying to make sure that to simplify the compiler, we implement the same conceptual TIR on both C++ and C runtimes. In the C++ runtime, we use PackedFunc as a “calling convention” to avoid needing to effectively hardcode C in various code generators. For instance, when dispatching to a compute library e.g. CUDA, a PackedFunc serves as a sort of adapter glue layer between TVM and CUDA.
  3. In the C++ runtime, not all PackedFunc live in the same runtime::Module. So, we need the string lookup to do a sort of “late-binding.” In the C runtime, you’re right that the primary use case for this late-binding is with the RPC server. Perhaps we should just change CodeGenC and CodeGenLLVM to implement tir.call_packed when targeting C runtime by calling the symbol directly with the PackedFunc API instead of invoking TVMBackendGetFuncFromEnv. Would this address your concerns?

The main thing here is to move the control code inside the library, and deliver the minimal API to use it

Ok, that makes sense.

I modified the strawman image (in the RFC) with a proper self-contained example to show the overall flow. Please, let me know if that explains things more clearly.

Yeah this makes sense. Sounds good to me.

I was thinking to have a separate module AOTModule that will import the different modules within it.

That also makes sense. I think my question was poorly worded before. Just confirming that, similar to MetadataModule, this would be lib, in the return value from graph_json, lib, params = tvm.relay.build()? At present, those things are wrapped in GraphRuntimeFactoryModule, and we’ll need to address that. I have another RFC forthcoming in a week or so to discuss changes there designed to support µTVM and accelerator use cases.

Hi @areusch ,

Thanks for the interesting reply! I am going to be off tomorrow, so I will probably think about your reply over the (long) week-end and get back to you early next week

Thanks, Giuseppe

I agree that going through TIR is a better way and will definitely allow for finer-grained control.

Hi Andrew,

for AOT runtime I agree we do not need JSON parsing or any of the underlying facilities it brings. However, given it seems like you’re planning to reuse the C-runtime memory allocator and interfaces in include/tvm/crt/platform.h, I think it would be great to continue using --runtime=c in the target string and create an additional flag or other tvm.relay.build() argument. I don’t know that the (graph) runtime specification belongs in the Target string.

Thanks for this clarification. Yes, this interface is fine for now. About the implementation we will have aot_runtime.h in a separate src/runtime/aot folder which will #include the crt memory manager from src/runtime/crt, for now. In future we will make a memory manager specific for AOT (possibly code generating information like the required memory to run the network).

Could you say why you need this set? Currently it’s always NULL. I think it would be great to develop a pattern to use it, but right now the most natural pattern is to set it to the TVMModule instance that contains the operator function.

So the short answer is that we don’t have a clear idea yet. But we were hoping to actually develop a pattern to use it, as you suggest. That’s though something I think deserves a separate and more detailed discussion :slight_smile:

  1. It would be nice to keep the logic for assembling PackedFunc args and handling return values in tir.call_packed. This way if we change the interface, we don’t have to look in too many places.
  2. Mainly I’m trying to make sure that to simplify the compiler, we implement the same conceptual TIR on both C++ and C runtimes. In the C++ runtime, we use PackedFunc as a “calling convention” to avoid needing to effectively hardcode C in various code generators. For instance, when dispatching to a compute library e.g. CUDA, a PackedFunc serves as a sort of adapter glue layer between TVM and CUDA.
  3. In the C++ runtime, not all PackedFunc live in the same runtime::Module. So, we need the string lookup to do a sort of “late-binding.” In the C runtime, you’re right that the primary use case for this late-binding is with the RPC server. Perhaps we should just change CodeGenC and CodeGenLLVM to implement tir.call_packed when targeting C runtime by calling the symbol directly with the PackedFunc API instead of invoking TVMBackendGetFuncFromEnv. Would this address your concerns?

Yes, I like this approach. Basically we get rid of the system library in c, but not of the dynamic system library in c++ (where it probably is less of an issue). This means this work could possibly be extended to support c++ runtime in the future.

That also makes sense. I think my question was poorly worded before. Just confirming that, similar to MetadataModule, this would be lib, in the return value from graph_json, lib, params = tvm.relay.build()? At present, those things are wrapped in GraphRuntimeFactoryModule, and we’ll need to address that. I have another RFC forthcoming in a week or so to discuss changes there designed to support µTVM and accelerator use cases.

Yes, this exactly what I meant. I am looking forward to the RFC!

Thanks,

Giuseppe

hi @giuseros,

About the implementation we will have aot_runtime.h in a separate src/runtime/aot folder

Would it be possible to create just a library e.g. src/runtime/crt/aot_executor? This will make things less complicated when the C runtime is distributed with a TVM wheel.

So the short answer is that we don’t have a clear idea yet. But we were hoping to actually develop a pattern to use it, as you suggest. That’s though something I think deserves a separate and more detailed discussion :slight_smile:

Okay that seems reasonable. I think there are definitely some good use cases for resource_handle, but want to make sure the abstraction is at the right level.

Basically we get rid of the system library in c, but not of the dynamic system library in c++ (where it probably is less of an issue). This means this work could possibly be extended to support c++ runtime in the future.

Yeah I think having a few implementation of tir.call_packed may provide more opportunities for future development. cc @tqchen for more thoughts here.

It would be nice to contemplate how we might be able to keep compatibility with --system-lib even if it may be overkill in some situations. I think a small C wrapper that effectively implements a tir.call_packed to instantiate the model could be one way to do this. We also don’t need to settle on this before making a first implementation of AOT in TIR.

Yes, this exactly what I meant. I am looking forward to the RFC!

Great, I’m iterating on this a bit and hope to post it now next week.

Hi all, I was finally able to have a first version of the AOT work in a PR upstream.

PR

You can find the PR here: [AOT] Introducing AOT in TVM by giuseros · Pull Request #7785 · apache/tvm · GitHub

At this stage, I gladly accept any feedback on things that can be improved in the PR or on issues I might have overlooked. Please, help me smoothing the edges of this work :slight_smile:

Limitations

There are two main limitation of the current work:

  • We didn’t add support for LLVM codegeneration. This is because we thought better to agree on the overall picture first using the c backend as POC, and then taking care of the LLVM backend
  • We didn’t include support for LetNode in the aot_codegen. Support for the LetNode is in the pipeline and will be added soon

Next steps

Bear in mind that this is only the first step of a journey. We are currently working on different improvements to AOT, in particular:

  • LLVM support LLVM support is currently being worked on and we are almost there
  • Name mangling We are adding name mangling into the picture, i.e., the user should be able to specify a prefix and this prefix should be added to all the global names used in the library. In this way, we will enable the user to build and link more than one network in the same application.
  • DLTensor surgery Since the memory allocation is done statically, we don’t need to carry DLTensor through the generated code, as it exposes metadata that are not consumed by the codegen and that increases the size of the binary image to be flashed on the microcontroller
  • Unpack the runner function signature Change the API of the runner function. Indeed, we would like the runner function to not have a packed API signature. This is to avoid instantiating type_ids or forcing a dynamic size of the function stack (all things that don’t add benefits in the embedded space, but take a toll in terms of code size, performance and power)
  • int64_t surgery Using int64_t on embedded devices usually increases in register spilling, which means power and performance will be heavily affected. We are removing this datatype in every place it’s being used.
  • Remove param lookup through __lookup_linked_param: in order to make things simple, we are currently reusing the __lookup_linked_param function to access the parameters in the library. However, with AOT we can simply create a TIR builtin that accesses the parameters directly without going through the issues of a function invocation. This is still with the aim of saving power, performance and space.

cc: @ramana-arm @manupa-arm @areusch @mbaret @stoa @mjs

1 Like

FYI: I will be out for Easter holidays until Tuesday (so I will be replying back to any comments as soon as I come back :slight_smile: )

Hi @giuseros, @manupa-arm,

I wanted to discuss one higher-level topic from the PR here: memory planning. Currently the AOT PR also implements some memory planning in its tvm_backend.c. I think it’d be great to separate that from the AOT PR and continue to use TVMBackendAllocWorkspace, even though it’s less efficient. The main reason for this is that we’re concurrently beginning to work on the Graph Memory Planner and I think it makes sense to handle all of the tensor pinning at that level, and produce some configuration that the executor can consume to decide where to place DLTensor at runtime.

This is fairly complex so we’ll release another RFC at some point in the future. What’re your thoughts here?

-Andrew

Hi @areusch , Just to be clear, we are not doing memory planning in the current AOT :slight_smile:

What you see in tvm_backend.c is a memory allocator. Instead of going through the complex page allocator needed with by the graph executor, we thought to implement a simpler one for aot, that behaves like a stack (with a LIFO policy).

This can be proved to work, because in AoT we allocate the storage identifiers through let statements and the internal operators also use let, so everything is LIFO and the stack works.

This couldn’t work with the graph executor mostly because of the JSON (which was using the same allocator but was not following a LIFO convention)

As a side note, we are also planning work on a global memory planner, so it would be good to catch up at some point in order to reduce overlap.

Thanks,

Andrew

@giuseros on microTVM, the actual implementation (when using TVMBackendAllocWorkspace) is left up to TVMPlatformMemoryAllocate. Would it be possible to move the lifo impl behind this call? This would make it easier to try the AOT executor in other non-micro use cases.

Agree we should discuss about global memory planning at some point soon.

Thanks everyone for great discussion so far and the initial AOT PoC. Thanks @giuseros and others for bringing in the first PoC. I finally get time to look into the proposed changes, these are great work.

My main comments so far have things to do with interface design and how to make things in an architecture consistent way.

Specifically, it would be great to think about the general API design and consolidation. In particular, we should de-couple the implementation of the API(AOT vs interpreter based) from the API interface design.

Ideally a user should use a similar compilation path for compiling(except for a different flag), exportation and then load a AOT module

Right now we can see are a few variants of ways to expose the model generated by AOT:

  • W0: Through runtime.Module and PackedFunc interface, the executor is a runtime.Module which contains three packed functions(set/get/run), this is in alignment with the Module based runtime interface mentioned in the previous
  • W1a: A standardized C API for graph/aot model execution only in C runtime.
  • W1b: A standardized C API runtime that wraps the module-based API(W0) and exposes a higher level API to the user.
  • W2: A separate C API that allows direct invocation of the generated model, specifically for AOT

From W2 => W1 => W0 there are different levels of standardization being involved.

For example, if AOT generates the code that obeys the W0 convention, then we can naturally test the result generated by AOT directly through python, run the code through RPC using the current set of infrastructure. The AOT tutorial can then directly sits insides the uTVM tutorials via python.

W1a and W1b are similar to each other(from the users’ PoV), except that in the case of W1b, W0 was the first class citizen, and the common part. W1a models things in another round. Finally W2 means the developers need to be aware of the backend that is being used.

Given the importance of embedded setting, I think it is useful to have some form of W1(a or b), that allows users to directly have a set of common convention for C runtime. However, such API ideally should not be AOT specific, but instead the “official” way to use all generated results in C API.

I also think it would be useful to always start by thinking about W0 support. Although W0 introduced an indirection(e.g. run function can be a direct C API instead of a PackedFunc), we already used PackedFunc through the per operator function, using PackedFunc for the general case won’t add too much of an overhead, but would enable the benefit mentioned above.

Would love to get everyone’s take, in terms of (1) engineering feasibility/ overhead of the Ws, (2) preference of the interface choice.

Hi @tqchen,

The main issue here is about the fact that we are targeting embedded environments. I am not a deep embedded expert (@mjs , @ramana-arm feel free to chime in), but my understanding is that the runtime API we offer to embedded developers needs to be quite minimal. Basically, we want to save also the single byte in order to fit in the limited space embedded devices provide.

So, given that we see AOT as the first step toward tiny devices, we opted for W1a, basically. Our preference, as things move forward, would be to have a tiny specific runtime interface that embedded developers can use that does not rely on large data structures (e.g., TVMValue or DLTensor) and that involves the minimalist set of #includes. So basically we are thinking along the line of W2 for embedded.

While I understand the benefits of a general interface, if we want to be comparable to embedded compilers (e.g., see https://arxiv.org/pdf/2007.10319.pdf) I think the need to abstract such a “tiny” interface is appropriate.

I am not against to future generalizations of the interface (i.e., W0 → W1b), but I think we can defer these to a later stage (also because they seem independent from the PR that is upstream), while focusing on embedded (W1a → W2) for now.

Thanks @giuseros I agree what you said about removing overheads for embedded.

In the meantime, it is also good to think about some form of standardization specifically for embedded land that maintains the minimalism while still offers some generality.

For example, some standardization around W1a, which removes the overhead of string lookup, but still preserves the CPackeFunc might be helpful. Since then the CPackedFunc would be able to serve as a generic way for users to plugin customized operators(because we still need a somewhat type erased function to remain general). We might also be able to further reduce the overhead if we aggressively perform link time optimization and inline all the CPackedFunc calls, translating the code themselves effectively similar to standard calls.

So it would be great if we could work together to come up with such standardization that we can use across. Once such standardization happens(e.g. in the form of W1a), we can provide addon libraries that exposes the tiny standard api to the c runtime so we can invoke these generated code through RPC, and then remove such dependencies when it comes to actual deployment.

@giuseros @tqchen

cc @stoa @mjs @ramana-arm @tgall_foo @gromero @aca88 @MJKlaiber

This is definitely a tricky topic because the firmware-facing API implies some part of the implementation. And, the implementation is necessarily going to be different between micro-land and traditional OS-land due a fundamental difference in the pattern by which accelerators are programmed:

  • On traditional OS, accelerators are lazily programmed at the time GetFunction is called. This allows for a Python interface that is both interactive and quite flexible.
  • In micro land, accelerators are programmed at some time before calling run(), and full control of that time must be given to the application/SDK.

While it may seem a bit premature to jump all the way to the accelerator use case here, I do so only because the closure architecture implied by GetFunction is particularly useful on traditional OS for accelerator Module implementation. GetFunction has become effectively the “load” function for GPU programming on traditional OS, in part because so doing complex processes such as JIT compilation as part of instantiating a Model executor is a common pattern.

By contrast, on a microcontroller, GetFunction is problematic from a memory perspective and in its role as the function typically used to program accelerators. PackedFunc in micro-land are just C functions that run on target_host, even if they do launch compute on an accelerator. If we were to consider the analogous use case in the C++ runtime, GetFunction itself does nothing here–LibraryModule merely implements GetFunction as dlsym. So in considering the API for a setting where no JIT programming is done and all functions are implemented on the target_host CPU, it’s not clear that the indirection provided by runtime.Module interface is a good fit.

The question is then what is the right interface. Here are some thoughts on properties of the “right” interface:

  • Approachable by C firmware engineers. At the end of the day, the interface needs to be usable. It should be clear what each function call implies, and each function call should imply the “expected” thing to a firmware engineer.
  • Designed to the memory constraints of embedded systems. All non-stack-allocated memory should be passed-in rather than dynamically allocated. The application should have full control of non-stack-allocated memory. The API should not imply excessive use of the stack.
  • Compatible with the standard TVM runtime API, where the design allows. While there are differences e.g. the one I outlined above, we should strive in particular to maintain an RPC-compatible API layer. Doing so enables autotuning and performance measurement without the need to write custom firmware. There is evidence of such a system in a couple of other embedded inference APIs, and given that autotuning can result in e.g. a 2x speedup over a random schedule, we can’t ignore the need to support it.

The last point makes it difficult to do an entirely clean-slate design for microTVM. I think option W0 from TQ’s post can’t be implemented with those above properties, so I’ll propose a couple options here and identify how they fall in TQ’s classifications:

  • W1a or W2. Implement two entirely disjoint APIs, one for standalone production inference and one for RPC-based inference

  • W1c. Build a single API with two parts:

    1. a subset meant for standalone inference, implemented with plain C APIs
    2. a superset meant for RPC-driven inference, implementing the Module API

    This is like W1b in that the C APIs implemented in 1 will match those from the Module-based interface, but we will invert the wrapping scheme (e.g. define an object-oriented interface, where the objects are wrapped in Module and functions are wrapped in PackedFunc when the RPC server is in use).

Given the maintenance burden involved, I prefer to try to make some form of W1 work. So in the rest of this post, I’ll work through the existing API and identify the parts I think we need to re-examine on microTVM.

Inventorying the C++ module-load process

Towards that last point, let’s examine the various parts of the TVM model inference on traditional OS so we can understand which pieces are RPC-dependent:

  1. tvm.runtime.load_module: Copies model runtime from disk to RAM, and performs a “load” procedure for each module.
    • For target_host-code (e.g. code produced by llvm and c backends), this amounts to dlopen and instantiating a LibraryModule to wrap that.
    • For other code, invokes a “loader” function to instantiate a Module from a BLOB.
  2. TVMModGetFunction("model_name"): Return a PackedFunc that creates a GraphExecutor for “model_name”
  3. model_name_pf(): e.g. call the previously-returned function. Instantiate GraphExecutor for “model_name,” implying:
    • Loading of the executor configuration (e.g. graph_json)
    • Allocating memory for input, intermediate, and output tensors
    • Invoking GetFunction() for each implemented operator, which performs accelerator-specific load procedures as discussed above.
    • Looking up parameters linked into the shared library.
  4. GraphExecutor#SetInput: Copy tensor data from a CPU-bound tensor to a tensor possibly located in accelerator memory.
  5. GraphExecutor#Run: Launch inference and wait for completion.
  6. GraphExecutor#GetOutput: Return TVMArray (e.g. DLTensor) pointing to output activation n, possibly located in accelerator memory.

Let’s now see which steps impact usage over RPC, and whether those APIs are friendly to micro constraints (e.g. can be kept in a microTVM standalone inference application) or not. The RPC-dependent pieces are steps 3-6 here (step 2 is handled by PackedFunc runtime.SystemLib() over RPC).

I think that, from an RPC perspective, steps 4-6 are fairly uncontroversial, because the RPC layer is involved with memory management and outside of that, steps 4-6 are merely function calls. On the memory management point, the RPC layer requires either a way to get a DLTensor handle or that the client allow the RPC server to create one through some form of memory dynamism. The former can be implemented under the memory constraints mentioned before, and the latter can be accommodated by the microTVM RPC server without impacting standalone inference.

So let’s now consider step 3, which actually does have some impact on standalone inference. Piece by piece:

  • Loading of the executor configuration: bad for the GraphExecutor (JSON parsing implies dynamic memory allocation). Not an issue with AOT.
  • Allocating memory for input, intermediate, and output tensors: the API must be expanded to allow the application to do this. New functionality will need to be introduced to microTVM RPC server to provide for this (likely, the microTVM RPC server needs to accept the same parameters as the Executor API, and forward those along when the API is invoked).
  • Invoking GetFunction() for each operator library: requires excessive dynamic memory (returned closure implies refcounting), and doesn’t buy us much because most operators are implemented by jumping the target_host CPU to implemented PackedFunc. In the current TVM API, this piece allows for accelerator programming. Some replacement provision needs to be made here.

From this, I think we can see that the Executor initialization API needs to be reworked on microTVM. I would broaden this to include runtime initialization, because:

  • It’s all too easy to bring in hardware considerations at any point in this process:
    • RAM banks may need to be turned on or brought out of retention a) at system startup b) between inference.
    • Accelerator programming will be part of initialization on some systems.
  • Often to hide e.g. startup latency, applications will want to handle hardware initialization at very early parts of the boot phase, so defining an API that requires waiting for e.g. said RAM banks to be available before starting other initialization could preclude some application- or SoC-specific init pattern.

Function Calling Convention

A key barrier to adopting W1b/c is that RPC requires the use of the PackedFunc calling convention while a firmware-facing C API is both more efficient and friendlier to developers using the standard C calling convention. Here are some thoughts towards unifying the two:

  • To start with, we have an invariant: we need to be able to call into operator implementations over RPC to implement autotuning and RPC-driven execution. So, when used with the RPC server, there must be at least some PackedFunc wrapper for each operator implementation.

  • The primary benefits of PackedFunc in the C++ runtime are:

    • it’s compatible with the RPC layer
    • it provides a standard calling convention, allowing the implementation to use any programming language. Since the C++ runtime directly invokes PF to offload operators to accelerators, the standard calling convention is particularly helpful.
    • functions can be “monkey-patched” at runtime if needed.

    In a standalone micro inference, none of these concerns apply. I would say that the PackedFunc calling convention doesn’t offer much benefit to implemented operator functions.

  • Given, this, a natural next question is: is it possible to translate PackedFunc into two pieces:

    1. An internal piece which uses standard C datatypes and calling convention
    2. A PackedFunc wrapper for said internal piece, which could be included only when compiling with RPC server.

    There are some examples with C++ PackedFunc of API styles that may be hard to translate. The most impactful example I can think of is the way that DLDevice are unpacked from GraphExecutor() PackedFunc args in a variadic fashion.

    Aside from this, it seems fairly straightforward to do, and may improve optimization in the downstream compiler.

It seems like then, it should be possible to implement some type of “unpacked” calling convention when targeting the C runtime. To do so:

  1. define a name mangling scheme to translate PackedFunc name to C function names
  2. Update codegen to produce the inner “unpacked” func
  3. Add a flag to control generation of the PackedFunc wrappers.

Reworking the Initialization APIs

There are three core areas of concern in reworking the initialization APIs:

  • C0. The existing runtime contains some pieces which are undesirable in a standalone inference application:
    • PackedFunc lookup tables (bloated, complex; in standalone inference, function call is a solved problem in micro-land)
    • Pieces of the runtime intended to support the RPC server (e.g. TVMFuncGetGlobal, TVMAPIGetLastError, RPCTimeEvaluator, etc)
    • Some NDArray functions (e.g. NDArray_Load, etc).
  • C1. How should we supply backing memory for tensors (input, intermediate, output) to executor instances?
  • C2. How, if at all, should the executor be involved with initialization (e.g. either initializing hardware, or providing software hooks, both at runtime startup and just before inference)?

C0 can be addressed by splitting common into two pieces:

  1. crt_backend_api.c and things required from this (except TVMBackendGetFuncFromEnv, see below). TVMBackend functions may be called from generated code, therefore of all API pieces, this one should absolutely belong with the standalone deployment subset.
  2. the rest, which can go with the RPC superset

C1: In a W1b unified API world, concern C1 is more closely tied to GraphPlanMemory. However, at present, only GraphExecutor consumes the output of GraphPlanMemory. In a micro world, the application must consume that output. The core thing we need to do to bridge the gap between an internally-consumed format which requires dynamic memory and a micro-friendly API is to make the output of GraphPlanMemory a data structure that makes sense for the application to consume. This would give the application control over the intermediate and output tensors, and require future changes to the memory planner to be cognizant of application requirements via unit tests.

Additionally towards C1, we should implement SetInputZeroCopy from the C++ GraphExecutor, and should probably actually just replace SetInput with that as the standard way to set an input tensor. This gives the application control over the input tensor.

C2. This one needs some input from the community. Here are some possible ways I could envision the executor interacting with the SoC during “initialization,” “pre-inference,” and “post-inference:”

  1. powering or bringing RAM in/out of retention for parameter/input loading.
  2. provide some signal to any hardware involved before starting a computation and after it’s finished.
  3. providing a designated place to hardware vendors where to place code that brings accelerators between e.g. reset → active → sleeping → active states.

Summary

  • I prefer W1c: implementing a small standalone inference-focused API and wrapping that in Module to allow AOT to be driven over RPC when needed.
  • As part of this: splitting the existing src/runtime/crt/common into a standalone piece (which includes TVMBackend APIs plus any needed to support this standalone piece) and a rpc piece (which includes the Module infrastructure).
  • The initialization APIs need to be reworked to allow for application-defined management of the Tensor memory, and some consideration for e.g. init hooks for deeper hardware integration should be provided.
  • Ultimately, this should result in a compact C-style API for standalone inference as proposed both here and in the STM32 port.

Would love to get everyone’s thoughts on this assessment and the suggested path forward! It’s possible this should split into its own RFC, so we can do that if people feel that would be more appropriate.

Thanks @areusch I agree some form of W1c is great. I still think it would be benefical to dissect and discuss the following factors to implement functions calls:

  • F0: PackedFunc(bad for uTVM land)
  • F1: CPackedFunc(directly call into the symbol but still uses the TVMValue and type code encoding).
  • F2: Normal unpacked function per C API

I agree that F0 should not be mandatory in embedded land so we don’t have to do string lookups. I still think we should standardize on F1 if possible, as it still provides a common standard for type-erased functions (e.g. a developer can use that to hand-wire customized operators without framework noticing the particular signature of the function).

Assuming we do link time optimization, compiler inlines the function, heap assignments and load into register, then the end effect of F1 could get close to F2.

From a codegen perspective, it seems like we shouldn’t need to choose between F1 and F2–these can just be different types of tir.call_* nodes in the TIR. A rewrite pass should be able to detect if the target function is codegen’d by a TVM generator which supports unpacked calls, and rewrite the TIR (and set function attributes) to reflect that. Viewed like this, F2 just becomes a further potential optimization of what we have in F1.

The main question in my mind is how we should expose the APIs. The standalone, firmware-facing API could either:

  • be implemented by the AOT codegen directly, if it supports it
  • be defined by a wrapper

In the case that we want to broaden the firmware-facing API beyond something that can be placed behind runtime.Module (e.g. something that may return a user-defined datatype e.g. get_info), we will need a wrapper implementation. So, it seems pretty inevitable we may start with a wrapper, and then potentially remove wrapping as we can.

One complication comes if we want to implement an API prefix e.g. ai_<model_name>_create. We may need a wrapper template in this case.

Assuming we do link time optimization, compiler inlines the function, heap assignments and load into register, then the end effect of F1 could get close to F2.

One thing I have learned is not to depend on the compiler to do anything :slight_smile:. It’s great if it can optimize this for us, but we may find a compiler that implements this correctly but which doesn’t fit all possible targets. So, I’d prefer to be as explicit as possible in codegen.

I think the external interface of the AOT module should follow the CPackedFunc format, since this is the interface currently used for all other externally visible functions. There would be an entry point function with a predefined name, and the order and meaning of its parameters could be established in a way analogous to how it currently works with tvm.build.

So would this suggestion then be compatible with including a small possibly-templated shim layer to translate between CPackedFunc and first-class C datatypes (e.g. int, float, DLTensor)? My feeling is that invoking CPackedFunc directly from firmware is burdensome, but perhaps not a big deal if handled by a shim layer.

When you talk about firmware, are you thinking about firmware calling the graph runner? If such calls are infrequent, CPackedFunc should not be that much of a burden. I’d like to stick with CPackedFunc, because we already use it.

Now, having said that, let me first describe how I view the execution model for AOT, because I’m not sure if we have the same ideas in mind.

Long story short:

  1. Tight coupling of the runner function with the operator functions.
  2. Limited set of functions exported from a module.
  3. Use assistance from targets to generate cross-target calls.

When thinking about AOT, I specifically have inference in mind, i.e. execution of a graph with a predefined set of parameters, where the inputs (activations) will vary from run to run. In that scenario, the graph executing function will be a part of the generated module. For the moment I’m assuming no accelerators.

Here we only have one entry point to the model: the runner function. The operator functions are no longer accessible from outside of the module. Because of that, the calling conventions used there ultimately don’t matter[1]. This may be a consideration for the codegen, though, since we want to make it possible to inline operator functions into the runner function (what I mean specifically is that we should make it reasonably easy for the compiler (TVM, LLVM, etc.) to see through the function calls). The runner function, however, would then still follow some established API, and for this I propose CPackedFunc.

With accelerators, we would have a device module, except this one would have several externally visible functions. Functions not visible outside of this module (i.e. callable only from inside of it) would not have any prescribed calling convention[1].

By the way, this all follows the shared library model, where certain functions are “exported”, i.e. callable from outside of it, while the rest are “internal”. The exported functions should follow a known convention, while the internal are unrestricted (at least from the point of view of a proposal like this).

If the runner function was external to the module, then all operator function would need to be exported from it, but it would come with a performance penalty.

The remaining part are cross-target function calls. I’m going to assume that there are no cycles between targets, with respect to function calls (i.e. if A calls B, then B cannot call A, similarly no “A calls B, B calls C, C calls A”, and so on). Here is where things get complicated, because we don’t want to use the GetFunction method. I think we will need to let each target implement the exact call sequence: we do that now using GetFunction at runtime, instead we’d need each target to apply its own codegen to generate the appropriate call sequence.

[1] We could still have some predefined convention, but it would only be a convention of convenience. This would make it possible to change it in the future without breaking things for users.

I know that there is already a prototype of it, but I think we should really just define the set of functions that an AOT runtime should implement, and then let each target implement its own. The more lower-level things are, the more hardware-specific they get.