@giuseros @tqchen
cc @stoa @mjs @ramana-arm @tgall_foo @gromero @aca88 @MJKlaiber
This is definitely a tricky topic because the firmware-facing API implies some part of the implementation. And, the implementation is necessarily going to be different between micro-land and traditional OS-land due a fundamental difference in the pattern by which accelerators are programmed:
- On traditional OS, accelerators are lazily programmed at the time
GetFunction
is called. This allows for a Python interface that is both interactive and quite flexible.
- In micro land, accelerators are programmed at some time before calling run(), and full control of that time must be given to the application/SDK.
While it may seem a bit premature to jump all the way to the accelerator use case here, I do so only because the closure architecture implied by GetFunction
is particularly useful on traditional OS for accelerator Module implementation. GetFunction
has become effectively the “load” function for GPU programming on traditional OS, in part because so doing complex processes such as JIT compilation as part of instantiating a Model executor is a common pattern.
By contrast, on a microcontroller, GetFunction
is problematic from a memory perspective and in its role as the function typically used to program accelerators. PackedFunc in micro-land are just C functions that run on target_host, even if they do launch compute on an accelerator. If we were to consider the analogous use case in the C++ runtime, GetFunction
itself does nothing here–LibraryModule merely implements GetFunction
as dlsym
. So in considering the API for a setting where no JIT programming is done and all functions are implemented on the target_host
CPU, it’s not clear that the indirection provided by runtime.Module
interface is a good fit.
The question is then what is the right interface. Here are some thoughts on properties of the “right” interface:
- Approachable by C firmware engineers. At the end of the day, the interface needs to be usable. It should be clear what each function call implies, and each function call should imply the “expected” thing to a firmware engineer.
- Designed to the memory constraints of embedded systems. All non-stack-allocated memory should be passed-in rather than dynamically allocated. The application should have full control of non-stack-allocated memory. The API should not imply excessive use of the stack.
- Compatible with the standard TVM runtime API, where the design allows. While there are differences e.g. the one I outlined above, we should strive in particular to maintain an RPC-compatible API layer. Doing so enables autotuning and performance measurement without the need to write custom firmware. There is evidence of such a system in a couple of other embedded inference APIs, and given that autotuning can result in e.g. a 2x speedup over a random schedule, we can’t ignore the need to support it.
The last point makes it difficult to do an entirely clean-slate design for microTVM. I think option W0 from TQ’s post can’t be implemented with those above properties, so I’ll propose a couple options here and identify how they fall in TQ’s classifications:
-
W1a or W2. Implement two entirely disjoint APIs, one for standalone production inference and one for RPC-based inference
-
W1c. Build a single API with two parts:
- a subset meant for standalone inference, implemented with plain C APIs
- a superset meant for RPC-driven inference, implementing the Module API
This is like W1b in that the C APIs implemented in 1 will match those from the Module-based interface, but we will invert the wrapping scheme (e.g. define an object-oriented interface, where the objects are wrapped in Module and functions are wrapped in PackedFunc when the RPC server is in use).
Given the maintenance burden involved, I prefer to try to make some form of W1 work. So in the rest of this post, I’ll work through the existing API and identify the parts I think we need to re-examine on microTVM.
Inventorying the C++ module-load process
Towards that last point, let’s examine the various parts of the TVM model inference on traditional OS so we can understand which pieces are RPC-dependent:
-
tvm.runtime.load_module
: Copies model runtime from disk to RAM, and performs a “load” procedure for each module.
- For
target_host
-code (e.g. code produced by llvm
and c
backends), this amounts to dlopen
and instantiating a LibraryModule
to wrap that.
- For other code, invokes a “loader” function to instantiate a
Module
from a BLOB.
-
TVMModGetFunction("model_name")
: Return a PackedFunc that creates a GraphExecutor for “model_name”
-
model_name_pf()
: e.g. call the previously-returned function. Instantiate GraphExecutor for “model_name,” implying:
- Loading of the executor configuration (e.g. graph_json)
- Allocating memory for input, intermediate, and output tensors
- Invoking GetFunction() for each implemented operator, which performs accelerator-specific load procedures as discussed above.
- Looking up parameters linked into the shared library.
-
GraphExecutor#SetInput
: Copy tensor data from a CPU-bound tensor to a tensor possibly located in accelerator memory.
-
GraphExecutor#Run
: Launch inference and wait for completion.
-
GraphExecutor#GetOutput
: Return TVMArray (e.g. DLTensor) pointing to output activation n
, possibly located in accelerator memory.
Let’s now see which steps impact usage over RPC, and whether those APIs are friendly to micro constraints (e.g. can be kept in a microTVM standalone inference application) or not. The RPC-dependent pieces are steps 3-6 here (step 2 is handled by PackedFunc runtime.SystemLib()
over RPC).
I think that, from an RPC perspective, steps 4-6 are fairly uncontroversial, because the RPC layer is involved with memory management and outside of that, steps 4-6 are merely function calls. On the memory management point, the RPC layer requires either a way to get a DLTensor handle or that the client allow the RPC server to create one through some form of memory dynamism. The former can be implemented under the memory constraints mentioned before, and the latter can be accommodated by the microTVM RPC server without impacting standalone inference.
So let’s now consider step 3, which actually does have some impact on standalone inference. Piece by piece:
- Loading of the executor configuration: bad for the GraphExecutor (JSON parsing implies dynamic memory allocation). Not an issue with AOT.
- Allocating memory for input, intermediate, and output tensors: the API must be expanded to allow the application to do this. New functionality will need to be introduced to microTVM RPC server to provide for this (likely, the microTVM RPC server needs to accept the same parameters as the Executor API, and forward those along when the API is invoked).
- Invoking GetFunction() for each operator library: requires excessive dynamic memory (returned closure implies refcounting), and doesn’t buy us much because most operators are implemented by jumping the
target_host
CPU to implemented PackedFunc. In the current TVM API, this piece allows for accelerator programming. Some replacement provision needs to be made here.
From this, I think we can see that the Executor initialization API needs to be reworked on microTVM. I would broaden this to include runtime initialization, because:
- It’s all too easy to bring in hardware considerations at any point in this process:
- RAM banks may need to be turned on or brought out of retention a) at system startup b) between inference.
- Accelerator programming will be part of initialization on some systems.
- Often to hide e.g. startup latency, applications will want to handle hardware initialization at very early parts of the boot phase, so defining an API that requires waiting for e.g. said RAM banks to be available before starting other initialization could preclude some application- or SoC-specific init pattern.
Function Calling Convention
A key barrier to adopting W1b/c is that RPC requires the use of the PackedFunc calling convention while a firmware-facing C API is both more efficient and friendlier to developers using the standard C calling convention. Here are some thoughts towards unifying the two:
-
To start with, we have an invariant: we need to be able to call into operator implementations over RPC to implement autotuning and RPC-driven execution. So, when used with the RPC server, there must be at least some PackedFunc wrapper for each operator implementation.
-
The primary benefits of PackedFunc in the C++ runtime are:
- it’s compatible with the RPC layer
- it provides a standard calling convention, allowing the implementation to use any programming language. Since the C++ runtime directly invokes PF to offload operators to accelerators, the standard calling convention is particularly helpful.
- functions can be “monkey-patched” at runtime if needed.
In a standalone micro inference, none of these concerns apply. I would say that the PackedFunc calling convention doesn’t offer much benefit to implemented operator functions.
-
Given, this, a natural next question is: is it possible to translate PackedFunc into two pieces:
- An internal piece which uses standard C datatypes and calling convention
- A PackedFunc wrapper for said internal piece, which could be included only when compiling with RPC server.
There are some examples with C++ PackedFunc of API styles that may be hard to translate. The most impactful example I can think of is the way that DLDevice are unpacked from GraphExecutor()
PackedFunc args in a variadic fashion.
Aside from this, it seems fairly straightforward to do, and may improve optimization in the downstream compiler.
It seems like then, it should be possible to implement some type of “unpacked” calling convention when targeting the C runtime. To do so:
- define a name mangling scheme to translate PackedFunc name to C function names
- Update codegen to produce the inner “unpacked” func
- Add a flag to control generation of the PackedFunc wrappers.
Reworking the Initialization APIs
There are three core areas of concern in reworking the initialization APIs:
- C0. The existing runtime contains some pieces which are undesirable in a standalone inference application:
- PackedFunc lookup tables (bloated, complex; in standalone inference, function call is a solved problem in micro-land)
- Pieces of the runtime intended to support the RPC server (e.g. TVMFuncGetGlobal, TVMAPIGetLastError, RPCTimeEvaluator, etc)
- Some NDArray functions (e.g. NDArray_Load, etc).
- C1. How should we supply backing memory for tensors (input, intermediate, output) to executor instances?
- C2. How, if at all, should the executor be involved with initialization (e.g. either initializing hardware, or providing software hooks, both at runtime startup and just before inference)?
C0 can be addressed by splitting common
into two pieces:
-
crt_backend_api.c
and things required from this (except TVMBackendGetFuncFromEnv, see below). TVMBackend
functions may be called from generated code, therefore of all API pieces, this one should absolutely belong with the standalone deployment subset.
- the rest, which can go with the RPC superset
C1: In a W1b unified API world, concern C1 is more closely tied to GraphPlanMemory
. However, at present, only GraphExecutor
consumes the output of GraphPlanMemory
. In a micro world, the application must consume that output. The core thing we need to do to bridge the gap between an internally-consumed format which requires dynamic memory and a micro-friendly API is to make the output of GraphPlanMemory
a data structure that makes sense for the application to consume. This would give the application control over the intermediate and output tensors, and require future changes to the memory planner to be cognizant of application requirements via unit tests.
Additionally towards C1, we should implement SetInputZeroCopy from the C++ GraphExecutor, and should probably actually just replace SetInput with that as the standard way to set an input tensor. This gives the application control over the input tensor.
C2. This one needs some input from the community. Here are some possible ways I could envision the executor interacting with the SoC during “initialization,” “pre-inference,” and “post-inference:”
- powering or bringing RAM in/out of retention for parameter/input loading.
- provide some signal to any hardware involved before starting a computation and after it’s finished.
- providing a designated place to hardware vendors where to place code that brings accelerators between e.g. reset → active → sleeping → active states.
Summary
- I prefer W1c: implementing a small standalone inference-focused API and wrapping that in Module to allow AOT to be driven over RPC when needed.
- As part of this: splitting the existing
src/runtime/crt/common
into a standalone
piece (which includes TVMBackend
APIs plus any needed to support this standalone piece) and a rpc
piece (which includes the Module infrastructure).
- The initialization APIs need to be reworked to allow for application-defined management of the Tensor memory, and some consideration for e.g. init hooks for deeper hardware integration should be provided.
- Ultimately, this should result in a compact C-style API for standalone inference as proposed both here and in the STM32 port.
Would love to get everyone’s thoughts on this assessment and the suggested path forward! It’s possible this should split into its own RFC, so we can do that if people feel that would be more appropriate.