Implementing AOT in TVM

tqchen · April 13, 2021, 8:22pm

Thanks @giuseros I agree what you said about removing overheads for embedded.

In the meantime, it is also good to think about some form of standardization specifically for embedded land that maintains the minimalism while still offers some generality.

For example, some standardization around W1a, which removes the overhead of string lookup, but still preserves the CPackeFunc might be helpful. Since then the CPackedFunc would be able to serve as a generic way for users to plugin customized operators(because we still need a somewhat type erased function to remain general). We might also be able to further reduce the overhead if we aggressively perform link time optimization and inline all the CPackedFunc calls, translating the code themselves effectively similar to standard calls.

So it would be great if we could work together to come up with such standardization that we can use across. Once such standardization happens(e.g. in the form of W1a), we can provide addon libraries that exposes the tiny standard api to the c runtime so we can invoke these generated code through RPC, and then remove such dependencies when it comes to actual deployment.

areusch · April 14, 2021, 7:30pm

@giuseros @tqchen

cc @stoa @mjs @ramana-arm @tgall_foo @gromero @aca88 @MJKlaiber

This is definitely a tricky topic because the firmware-facing API implies some part of the implementation. And, the implementation is necessarily going to be different between micro-land and traditional OS-land due a fundamental difference in the pattern by which accelerators are programmed:

On traditional OS, accelerators are lazily programmed at the time GetFunction is called. This allows for a Python interface that is both interactive and quite flexible.
In micro land, accelerators are programmed at some time before calling run(), and full control of that time must be given to the application/SDK.

While it may seem a bit premature to jump all the way to the accelerator use case here, I do so only because the closure architecture implied by GetFunction is particularly useful on traditional OS for accelerator Module implementation. GetFunction has become effectively the “load” function for GPU programming on traditional OS, in part because so doing complex processes such as JIT compilation as part of instantiating a Model executor is a common pattern.

By contrast, on a microcontroller, GetFunction is problematic from a memory perspective and in its role as the function typically used to program accelerators. PackedFunc in micro-land are just C functions that run on target_host, even if they do launch compute on an accelerator. If we were to consider the analogous use case in the C++ runtime, GetFunction itself does nothing here–LibraryModule merely implements GetFunction as dlsym. So in considering the API for a setting where no JIT programming is done and all functions are implemented on the target_host CPU, it’s not clear that the indirection provided by runtime.Module interface is a good fit.

The question is then what is the right interface. Here are some thoughts on properties of the “right” interface:

Approachable by C firmware engineers. At the end of the day, the interface needs to be usable. It should be clear what each function call implies, and each function call should imply the “expected” thing to a firmware engineer.
Designed to the memory constraints of embedded systems. All non-stack-allocated memory should be passed-in rather than dynamically allocated. The application should have full control of non-stack-allocated memory. The API should not imply excessive use of the stack.
Compatible with the standard TVM runtime API, where the design allows. While there are differences e.g. the one I outlined above, we should strive in particular to maintain an RPC-compatible API layer. Doing so enables autotuning and performance measurement without the need to write custom firmware. There is evidence of such a system in a couple of other embedded inference APIs, and given that autotuning can result in e.g. a 2x speedup over a random schedule, we can’t ignore the need to support it.

The last point makes it difficult to do an entirely clean-slate design for microTVM. I think option W0 from TQ’s post can’t be implemented with those above properties, so I’ll propose a couple options here and identify how they fall in TQ’s classifications:

W1a or W2. Implement two entirely disjoint APIs, one for standalone production inference and one for RPC-based inference
W1c. Build a single API with two parts:
1. a subset meant for standalone inference, implemented with plain C APIs
2. a superset meant for RPC-driven inference, implementing the Module API
This is like W1b in that the C APIs implemented in 1 will match those from the Module-based interface, but we will invert the wrapping scheme (e.g. define an object-oriented interface, where the objects are wrapped in Module and functions are wrapped in PackedFunc when the RPC server is in use).

Given the maintenance burden involved, I prefer to try to make some form of W1 work. So in the rest of this post, I’ll work through the existing API and identify the parts I think we need to re-examine on microTVM.

Inventorying the C++ module-load process

Towards that last point, let’s examine the various parts of the TVM model inference on traditional OS so we can understand which pieces are RPC-dependent:

tvm.runtime.load_module: Copies model runtime from disk to RAM, and performs a “load” procedure for each module.
- For target_host-code (e.g. code produced by llvm and c backends), this amounts to dlopen and instantiating a LibraryModule to wrap that.
- For other code, invokes a “loader” function to instantiate a Module from a BLOB.
TVMModGetFunction("model_name"): Return a PackedFunc that creates a GraphExecutor for “model_name”
model_name_pf(): e.g. call the previously-returned function. Instantiate GraphExecutor for “model_name,” implying:
- Loading of the executor configuration (e.g. graph_json)
- Allocating memory for input, intermediate, and output tensors
- Invoking GetFunction() for each implemented operator, which performs accelerator-specific load procedures as discussed above.
- Looking up parameters linked into the shared library.
GraphExecutor#SetInput: Copy tensor data from a CPU-bound tensor to a tensor possibly located in accelerator memory.
GraphExecutor#Run: Launch inference and wait for completion.
GraphExecutor#GetOutput: Return TVMArray (e.g. DLTensor) pointing to output activation n, possibly located in accelerator memory.

Let’s now see which steps impact usage over RPC, and whether those APIs are friendly to micro constraints (e.g. can be kept in a microTVM standalone inference application) or not. The RPC-dependent pieces are steps 3-6 here (step 2 is handled by PackedFunc runtime.SystemLib() over RPC).

I think that, from an RPC perspective, steps 4-6 are fairly uncontroversial, because the RPC layer is involved with memory management and outside of that, steps 4-6 are merely function calls. On the memory management point, the RPC layer requires either a way to get a DLTensor handle or that the client allow the RPC server to create one through some form of memory dynamism. The former can be implemented under the memory constraints mentioned before, and the latter can be accommodated by the microTVM RPC server without impacting standalone inference.

So let’s now consider step 3, which actually does have some impact on standalone inference. Piece by piece:

Loading of the executor configuration: bad for the GraphExecutor (JSON parsing implies dynamic memory allocation). Not an issue with AOT.
Allocating memory for input, intermediate, and output tensors: the API must be expanded to allow the application to do this. New functionality will need to be introduced to microTVM RPC server to provide for this (likely, the microTVM RPC server needs to accept the same parameters as the Executor API, and forward those along when the API is invoked).
Invoking GetFunction() for each operator library: requires excessive dynamic memory (returned closure implies refcounting), and doesn’t buy us much because most operators are implemented by jumping the target_host CPU to implemented PackedFunc. In the current TVM API, this piece allows for accelerator programming. Some replacement provision needs to be made here.

From this, I think we can see that the Executor initialization API needs to be reworked on microTVM. I would broaden this to include runtime initialization, because:

It’s all too easy to bring in hardware considerations at any point in this process:
- RAM banks may need to be turned on or brought out of retention a) at system startup b) between inference.
- Accelerator programming will be part of initialization on some systems.
Often to hide e.g. startup latency, applications will want to handle hardware initialization at very early parts of the boot phase, so defining an API that requires waiting for e.g. said RAM banks to be available before starting other initialization could preclude some application- or SoC-specific init pattern.

Function Calling Convention

A key barrier to adopting W1b/c is that RPC requires the use of the PackedFunc calling convention while a firmware-facing C API is both more efficient and friendlier to developers using the standard C calling convention. Here are some thoughts towards unifying the two:

To start with, we have an invariant: we need to be able to call into operator implementations over RPC to implement autotuning and RPC-driven execution. So, when used with the RPC server, there must be at least some PackedFunc wrapper for each operator implementation.
The primary benefits of PackedFunc in the C++ runtime are:
- it’s compatible with the RPC layer
- it provides a standard calling convention, allowing the implementation to use any programming language. Since the C++ runtime directly invokes PF to offload operators to accelerators, the standard calling convention is particularly helpful.
- functions can be “monkey-patched” at runtime if needed.
In a standalone micro inference, none of these concerns apply. I would say that the PackedFunc calling convention doesn’t offer much benefit to implemented operator functions.
Given, this, a natural next question is: is it possible to translate PackedFunc into two pieces:
1. An internal piece which uses standard C datatypes and calling convention
2. A PackedFunc wrapper for said internal piece, which could be included only when compiling with RPC server.
There are some examples with C++ PackedFunc of API styles that may be hard to translate. The most impactful example I can think of is the way that DLDevice are unpacked from GraphExecutor() PackedFunc args in a variadic fashion.

Aside from this, it seems fairly straightforward to do, and may improve optimization in the downstream compiler.

It seems like then, it should be possible to implement some type of “unpacked” calling convention when targeting the C runtime. To do so:

define a name mangling scheme to translate PackedFunc name to C function names
Update codegen to produce the inner “unpacked” func
Add a flag to control generation of the PackedFunc wrappers.

Reworking the Initialization APIs

There are three core areas of concern in reworking the initialization APIs:

C0. The existing runtime contains some pieces which are undesirable in a standalone inference application:
- PackedFunc lookup tables (bloated, complex; in standalone inference, function call is a solved problem in micro-land)
- Pieces of the runtime intended to support the RPC server (e.g. TVMFuncGetGlobal, TVMAPIGetLastError, RPCTimeEvaluator, etc)
- Some NDArray functions (e.g. NDArray_Load, etc).
C1. How should we supply backing memory for tensors (input, intermediate, output) to executor instances?
C2. How, if at all, should the executor be involved with initialization (e.g. either initializing hardware, or providing software hooks, both at runtime startup and just before inference)?

C0 can be addressed by splitting common into two pieces:

crt_backend_api.c and things required from this (except TVMBackendGetFuncFromEnv, see below). TVMBackend functions may be called from generated code, therefore of all API pieces, this one should absolutely belong with the standalone deployment subset.
the rest, which can go with the RPC superset

C1: In a W1b unified API world, concern C1 is more closely tied to GraphPlanMemory. However, at present, only GraphExecutor consumes the output of GraphPlanMemory. In a micro world, the application must consume that output. The core thing we need to do to bridge the gap between an internally-consumed format which requires dynamic memory and a micro-friendly API is to make the output of GraphPlanMemory a data structure that makes sense for the application to consume. This would give the application control over the intermediate and output tensors, and require future changes to the memory planner to be cognizant of application requirements via unit tests.

Additionally towards C1, we should implement SetInputZeroCopy from the C++ GraphExecutor, and should probably actually just replace SetInput with that as the standard way to set an input tensor. This gives the application control over the input tensor.

C2. This one needs some input from the community. Here are some possible ways I could envision the executor interacting with the SoC during “initialization,” “pre-inference,” and “post-inference:”

powering or bringing RAM in/out of retention for parameter/input loading.
provide some signal to any hardware involved before starting a computation and after it’s finished.
providing a designated place to hardware vendors where to place code that brings accelerators between e.g. reset → active → sleeping → active states.

Summary

I prefer W1c: implementing a small standalone inference-focused API and wrapping that in Module to allow AOT to be driven over RPC when needed.
As part of this: splitting the existing src/runtime/crt/common into a standalone piece (which includes TVMBackend APIs plus any needed to support this standalone piece) and a rpc piece (which includes the Module infrastructure).
The initialization APIs need to be reworked to allow for application-defined management of the Tensor memory, and some consideration for e.g. init hooks for deeper hardware integration should be provided.
Ultimately, this should result in a compact C-style API for standalone inference as proposed both here and in the STM32 port.

Would love to get everyone’s thoughts on this assessment and the suggested path forward! It’s possible this should split into its own RFC, so we can do that if people feel that would be more appropriate.

tqchen · April 14, 2021, 8:34pm

Thanks @areusch I agree some form of W1c is great. I still think it would be benefical to dissect and discuss the following factors to implement functions calls:

F0: PackedFunc(bad for uTVM land)
F1: CPackedFunc(directly call into the symbol but still uses the TVMValue and type code encoding).
F2: Normal unpacked function per C API

I agree that F0 should not be mandatory in embedded land so we don’t have to do string lookups. I still think we should standardize on F1 if possible, as it still provides a common standard for type-erased functions (e.g. a developer can use that to hand-wire customized operators without framework noticing the particular signature of the function).

Assuming we do link time optimization, compiler inlines the function, heap assignments and load into register, then the end effect of F1 could get close to F2.

areusch · April 14, 2021, 8:46pm

From a codegen perspective, it seems like we shouldn’t need to choose between F1 and F2–these can just be different types of tir.call_* nodes in the TIR. A rewrite pass should be able to detect if the target function is codegen’d by a TVM generator which supports unpacked calls, and rewrite the TIR (and set function attributes) to reflect that. Viewed like this, F2 just becomes a further potential optimization of what we have in F1.

The main question in my mind is how we should expose the APIs. The standalone, firmware-facing API could either:

be implemented by the AOT codegen directly, if it supports it
be defined by a wrapper

In the case that we want to broaden the firmware-facing API beyond something that can be placed behind runtime.Module (e.g. something that may return a user-defined datatype e.g. get_info), we will need a wrapper implementation. So, it seems pretty inevitable we may start with a wrapper, and then potentially remove wrapping as we can.

One complication comes if we want to implement an API prefix e.g. ai_<model_name>_create. We may need a wrapper template in this case.

Assuming we do link time optimization, compiler inlines the function, heap assignments and load into register, then the end effect of F1 could get close to F2.

One thing I have learned is not to depend on the compiler to do anything . It’s great if it can optimize this for us, but we may find a compiler that implements this correctly but which doesn’t fit all possible targets. So, I’d prefer to be as explicit as possible in codegen.

kparzysz · April 14, 2021, 8:46pm

I think the external interface of the AOT module should follow the CPackedFunc format, since this is the interface currently used for all other externally visible functions. There would be an entry point function with a predefined name, and the order and meaning of its parameters could be established in a way analogous to how it currently works with tvm.build.

areusch · April 14, 2021, 9:40pm

So would this suggestion then be compatible with including a small possibly-templated shim layer to translate between CPackedFunc and first-class C datatypes (e.g. int, float, DLTensor)? My feeling is that invoking CPackedFunc directly from firmware is burdensome, but perhaps not a big deal if handled by a shim layer.

kparzysz · April 14, 2021, 10:33pm

When you talk about firmware, are you thinking about firmware calling the graph runner? If such calls are infrequent, CPackedFunc should not be that much of a burden. I’d like to stick with CPackedFunc, because we already use it.

Now, having said that, let me first describe how I view the execution model for AOT, because I’m not sure if we have the same ideas in mind.

Long story short:

Tight coupling of the runner function with the operator functions.
Limited set of functions exported from a module.
Use assistance from targets to generate cross-target calls.

When thinking about AOT, I specifically have inference in mind, i.e. execution of a graph with a predefined set of parameters, where the inputs (activations) will vary from run to run. In that scenario, the graph executing function will be a part of the generated module. For the moment I’m assuming no accelerators.

Here we only have one entry point to the model: the runner function. The operator functions are no longer accessible from outside of the module. Because of that, the calling conventions used there ultimately don’t matter[1]. This may be a consideration for the codegen, though, since we want to make it possible to inline operator functions into the runner function (what I mean specifically is that we should make it reasonably easy for the compiler (TVM, LLVM, etc.) to see through the function calls). The runner function, however, would then still follow some established API, and for this I propose CPackedFunc.

With accelerators, we would have a device module, except this one would have several externally visible functions. Functions not visible outside of this module (i.e. callable only from inside of it) would not have any prescribed calling convention[1].

By the way, this all follows the shared library model, where certain functions are “exported”, i.e. callable from outside of it, while the rest are “internal”. The exported functions should follow a known convention, while the internal are unrestricted (at least from the point of view of a proposal like this).

If the runner function was external to the module, then all operator function would need to be exported from it, but it would come with a performance penalty.

The remaining part are cross-target function calls. I’m going to assume that there are no cycles between targets, with respect to function calls (i.e. if A calls B, then B cannot call A, similarly no “A calls B, B calls C, C calls A”, and so on). Here is where things get complicated, because we don’t want to use the GetFunction method. I think we will need to let each target implement the exact call sequence: we do that now using GetFunction at runtime, instead we’d need each target to apply its own codegen to generate the appropriate call sequence.

[1] We could still have some predefined convention, but it would only be a convention of convenience. This would make it possible to change it in the future without breaking things for users.

kparzysz · April 14, 2021, 10:42pm

I know that there is already a prototype of it, but I think we should really just define the set of functions that an AOT runtime should implement, and then let each target implement its own. The more lower-level things are, the more hardware-specific they get.

giuseros · April 15, 2021, 11:23am

Hi all,

Thanks for the interesting discussion! So, we all agree that there are three points here:

Backend API
Calling convention
Runtime API As things stand today, memory allocation is part of the backend API. This will change with global memory planning, but for now I would tend to skip the C1 concern about memory and discuss it in a separate RFC.

I agree that the way forward is some type of W1{a,b,c}. I will try to sum up the points in my own words, correct me if I am wrong

Backend API

As @areusch correctly pointed out, this is the API that the code generated uses as utility functions. From my POV this is the real runtime of our compiler. Our approach would be to reduce, at least for AoT, this API to a minimum set of functions

Memory allocation (for now)
Parallel execution
What about errors? For now the error API (setLastError, getLastError) is part of the c_runtime_api, but the setter should be part of the backend API and the getter of the runtime_api.

I agree with @areusch about having a crt_backend_api.c minimal and a rcp_backend_api.c that adds more functionality.

Would it make sense to also have a crt_backend_api.h as well? Or we should still reuse the original c_backend_api.h interface? I am asking because that interface defines things like TVMValue 64 bits unions, which clash with a minimalist embedded environment (more on that in a second). Also for now TVMBackendAllocWorkspace is accepting a int64 parameter, which would be nice to remove (even though we will remove it once we do global memory planning).

Calling convention

This is the bit I think is more controversial. So, to make things clear, when we refer to a CPackedFunc, we are talking about:

typedef int (*TVMBackendPackedCFunc)(TVMValue* args, int* type_codes, int num_args,
                                     TVMValue* out_ret_value, int* out_ret_tcode,
                                     void* resource_handle);

From what I understand @kparzysz you are saying that the internal functions don’t matter (they will be static private functions) but that the runner function should have this signature. Can I ask you why? Actually, we are trying to move toward a C compatible API for both internal operators and the external runner function:

typedef int (*TVMBackendCFunc)(void** inputs, void** outputs, void* resource_handle);

For three main reasons:

TVMValue is a int64 union, and most embedded devices will struggle to deal with int64.
TVMValue s need to be packed/unpacked every time for every operator call
If the user has got an array of inputs, and passes it to the runner function, the runner function needs to dynamically create an array of TVMValues on the stack and populate it with the inputs from the user.

@areusch @tqchen I guess that we can add a TVMBackendPackedCFunc wrapper function if the RPC side of the things need it. But is there any reason for not having the low level function written in plain C without typeids?

Runtime API

Now that can be quite a long conversation if we want to draft it all here Let’s try to define some guidelines, taking as example the function to “run” a network:

The main function exposed to the user should be the tvm_runtime_run in the style of the bundle_static.c
The RPC API and graph executor can easily implement tvm_runtime_run, indeed this is already done in bundle_static.c
AOT will use that to act on the internal structure tvm_model_t which has been code generated.

Actually, I think the best way to move forward would be to sketch something and progressively agree on how it looks. Have you got any suggestions on how to do this sort of “sketching”? Maybe a draft PR not meant to be merged but only to spark discussion?

Thanks again, this is all very interesting.

Giuseppe

tqchen · April 15, 2021, 1:01pm

Thanks @giuseros . To just discuss a bit on the difference between the following two type erased interface

X0: Function with typeid

typedef int (*TVMBackendPackedCFunc)(TVMValue* args, int* type_codes, int num_args,
                                     TVMValue* out_ret_value, int* out_ret_tcode,
                                     void* resource_handle)

X1: Function without typeid

typedef int (*TVMBackendCFunc)(void** inputs, void** outputs, void* resource_handle);

Discussions

The main reason that we choosed X0 over X1 is because X0 gives a safe interface for both static and dynamic languages. Imagine a case where the callee passes in a integer but the caller expects a float. X1 won’t provide any mechanism to detect such mismatch during runtime(if debug is enabled) while X0 allows us to provide type checking to do so.

X0 is also a complete Packed representation in a sense that function defined in X0 can be directly exposed to PackedFunc without any additional wrapping, giving the benfit of say debugging on host first before run on embedded, and invoke through RPC or python. Function exposed in X1 would requires a manual rewrapping, which defeats the purpose of type erasure.

Overhead of X0 over X1

Making a function call in the X1 convention would also requires stack allocations(for the array of inputs and outputs). Without considering any compiler optimization, if we get down to number of bytes in a 32bit system, a function call with n number of arguments one output. A function call in the form of X0 would cost us 8 * n + 4 * n + 4 + 8+ 4 + 4 = 12 * n +20 bytes of space, while a function call in the form of X1 would cost us 4* n + 4 * n + 4 = 8 * n +20 bytes of space. Say n=3 (a typical number), then X0 would cost 84 bytes, while X1 will cost 44 bytes.

The memory overhead of the function call, when comparing to the followup memory operations on NDArrays(which normally contains KB or more memory) is negilible.

Considering Compiler Optimizations

Additionally, this is considering no compiler optimization. Let us think about what will happen when the compiler inlines the call. In such cases the function call becomes a load and store into a heap memory.

With a typical mem2reg pass, the heap space can be promoted to registers. If callee code(operator) is compiled to not read the typeid in release mode, then the assignment to typeid becomes deadcode and will be eliminated by the compiler. Similarly the argument passing could become direct argument passing in this case. Considering these compiler optimizations, both X0 and X1 would allow optimizations that leads to similar performing code as the final direct call form.

Back to the topic of the int64, note that most of our operator call only uses void* as argument and not int64. The cost of int64 is mainly a memory overhead of passing argument rather than an ALU concern, both the caller and callee can feel free to convert to int32 after the passing, and assign int32 fields during passing(via an int32[2] array and assume little endian), again considering possible compiler optimizations above this could turns out to be nop.

Even in the absence of compiler optimizations, the general overhead incurred in the X0 is not too much larger than X1, and it would be great to do some on workload of interest to see the difference.

kparzysz · April 15, 2021, 1:21pm

The reason is that this is the signature that we’re currently using for all TVM-generated functions. To call a function like that all you need to know what parameters the function takes. This is essentially the definition of TVM-specific ABI, but on the level of the C language. This allows compilers to generate the same steps to call any function.

If you want to develop a different ABI, it will need to be used for all external calls to functions generated by TVM, or else we lose that universality.

tqchen · April 15, 2021, 1:32pm

To further illustrate what I meant by the impact of compiler optimizations, i ran the following quick experiment:

// test.cc
#include <tvm/runtime/c_runtime_api.h>                                                                                                                                                        
                                                 
// implement the function using PackedCFunc calling convention                                                                                                                                             
inline int PackedCFunc(TVMValue* args, int* type_codes, int num_args,                                                                                                                         
                       TVMValue* out_ret_value, int* out_ret_tcode,                                                                                                                           
                       void* resource_handle) {                                                                                                                                               
  int v0 = args[0].v_int64;                                                                                                                                                                   
  void* ptr = args[1].v_handle;                                                                                                                                                               
  out_ret_tcode[0] = kTVMArgInt;                                                                                                                                                              
  out_ret_value[0].v_int64 = v0 + ((int*)ptr)[0];                                                                                                                                             
  return 0;                                                                                                                                                                                   
}                                                                                                                                                                                             
                                                                                                                                                                                              
// return x + ptr[0];                                                                                                                                                                         
extern "C" int AddViaPackedCFunc(int x, int* ptr) {                                                                                                                                           
  TVMValue args[2];                                                                                                                                                                           
  int type_codes[2];                                                                                                                                                                          
  TVMValue out_ret_value;                                                                                                                                                                     
  int out_ret_tcode;                                                                                                                                                                          
                                                                                                                                                                                              
  args[0].v_int64 = x;                                                                                                                                                                        
  args[1].v_handle = ptr;                                                                                                                                                                     
  type_codes[0] = kTVMArgInt;                                                                                                                                                                 
  type_codes[1] = kTVMOpaqueHandle;                                                                                                                                                           
  PackedCFunc(args, type_codes, 2, &out_ret_value, &out_ret_tcode, nullptr);                                                                                                                  
  return out_ret_value.v_int64;                                                                                                                                                               
}

Result of Clang

Run command

clang-10 -O2 -S -emit-llvm -I /path/to/tvm/3rdparty/dlpack/include -I /path/to/tvm/include -o test.ll test.cc   
cat test.ll

Gives the following code(meta data removed)

; Function Attrs: nounwind readonly uwtable
define dso_local i32 @AddViaPackedCFunc(i32 %0, i32* %1) local_unnamed_addr #0 {
  %3 = load i32, i32* %1, align 4, !tbaa !2
  %4 = add nsw i32 %3, %0
  ret i32 %4
}

Result of GCC

gcc -O2 -S  -I /path/to/tvm/3rdparty/dlpack/include -I /path/to/tvm/include -o test.s test.cc
cat test.s

	.file	"test.cc"
	.text
	.p2align 4,,15
	.globl	AddViaPackedCFunc
	.type	AddViaPackedCFunc, @function
AddViaPackedCFunc:
.LFB1:
	.cfi_startproc
	movl	(%rsi), %eax
	addl	%edi, %eax
	ret
	.cfi_endproc
.LFE1:
	.size	AddViaPackedCFunc, .-AddViaPackedCFunc
	.ident	"GCC: (Ubuntu 7.4.0-1ubuntu1~18.04.1) 7.4.0"
	.section	.note.GNU-stack,"",@progbits

Discussions

As we can see this is esssentially equivalent to the direct C calling

int Add(int x, int *ptr) {
  return x + ptr[0]
}

To understand what is happening under the hood, the following optimization will leads to this result:

Inlining that inlines the call
Mem2reg that promote the head store/load to register operations
Deadcode elimination that eliminates the unused type id
Reasoning around in32 passing via int64, cast<int32>(cast<int64>(x)) = x when x is i32

Compiling Code with TypeId Checking

The compiler can even do smarter things, when we have code that already includes the type code check. We can try out the same experiment on the following code, we will find that the result is the same as the direct C calling without any type id checking.This is because compiler can inline, constant fold and then dead-code eliminate the type id checking part.

#include <cstdio>                                                                                                                                                                             
#include <tvm/runtime/c_runtime_api.h>                                                                                                                                                        
                                                                                                                                                                                              
inline int PackedCFunc(TVMValue* args, int* type_codes, int num_args,                                                                                                                         
                       TVMValue* out_ret_value, int* out_ret_tcode,                                                                                                                           
                       void* resource_handle) {                                                                                                                                               
  int v0 = args[0].v_int64;                                                                                                                                                                   
  void* ptr = args[1].v_handle;                                                                                                                                                               
  // error check that can be dead-code eliminated                                                                                                                                             
  if (type_codes[0] != kTVMArgInt) {                                                                                                                                                          
    return -1;                                                                                                                                                                                
  }                                                                                                                                                                                           
  if (type_codes[1] != kTVMOpaqueHandle) {                                                                                                                                                    
    return -1;                                                                                                                                                                                
  }                                                                                                                                                                                           
                                                                                                                                                                                              
  out_ret_tcode[0] = kTVMArgInt;                                                                                                                                                              
  out_ret_value[0].v_int64 = v0 + ((int*)ptr)[0];                                                                                                                                             
  return 0;                                                                                                                                                                                   
}                                                                                                                                                                                             
                                                                                                                                                                                              
// return x + ptr[0];                                                                                                                                                                         
extern "C" int AddViaPackedCFunc(int x, int* ptr) {                                                                                                                                           
  TVMValue args[2];                                                                                                                                                                           
  int type_codes[2];                                                                                                                                                                          
  TVMValue out_ret_value;                                                                                                                                                                     
  int out_ret_tcode;                                                                                                                                                                          
                                                                                                                                                                                              
  args[0].v_int64 = x;                                                                                                                                                                        
  args[1].v_handle = ptr;                                                                                                                                                                     
  type_codes[0] = kTVMArgInt;                                                                                                                                                                 
  type_codes[1] = kTVMOpaqueHandle;                                                                                                                                                           
                                                                                                                                                                                              
  // note: check can be dead-code eliminated                                                                                                                                                  
  if (PackedCFunc(args, type_codes, 2, &out_ret_value, &out_ret_tcode, nullptr) != 0) {                                                                                                       
    printf("error\n");                                                                                                                                                                        
  }                                                                                                                                                                                           
  if (out_ret_tcode != kTVMArgInt) {                                                                                                                                                          
    printf("error\n");                                                                                                                                                                        
  }                                                                                                                                                                                           
  return out_ret_value.v_int64;                                                                                                                                                               
}

areusch · April 15, 2021, 8:14pm

@giuseros @tqchen @kparzysz

Lots to catch up on here, thanks for the great discussions!

@giuseros:

I agree with @areusch about having a crt_backend_api.c minimal and a rcp_backend_api.c that adds more functionality.

Just to clarify: I think that the TVMBackend functions are strictly meant to be called from generated code, and I think that all generated code (be it RPC or otherwise) should use the same backend API (however, it’s possible some functions may be left unimplemented depending on the compilation settings). I think it’s possible we might have multiple implementations, but I’d expect that we would need just one for both Graph and AOT executor on micro.

I do think that crt_runtime_api.c is mostly specific to RPC-driven execution, and will likely be moved almost wholesale out of the backend library.

From my POV this is the real runtime of our compiler.

This is actually kinda true–crt_backend_api.h should define the “runtime.” Right now due to the memory allocation though, it spills over into platform.h and other libraries of CRT. Perhaps we can find a way to consolidate these so the organization makes more sense.

Our approach would be to reduce, at least for AoT, this API to a minimum set of functions

Memory allocation (for now)

Parallel execution

What about errors? For now the error API (setLastError, getLastError) is part of the >c_runtime_api, but the setter should be part of the backend API and the getter of the runtime_api.

So I agree with this, and I think that crt_backend_api.h should define anything needed here. The implementation of that, for the C runtime, should eventually be possible to do on terms favorable to embedded development (e.g. memory management should be something we can handle without dynamic memory).

Errors: currently convention is to return an int32_t from all PackedFunc if an error occurs. On microTVM, I have slightly abused this to allow us to return tvm_crt_error_t from TVMBackend functions (e.g. TVMBackendAllocWorkspace). In my experience, the more detail an error can provide, the better off you are being able to debug it. So kind of, I think we should adopt somewhere explicitly the convention that 0 is success and non-zero is a runtime-specific error code; this would then enshrine our ability to continue doing this.

Errors in embedded systems can take weeks to reproduce and may be caught only through logging systems built to store records of them in production. My strong opinion is that we cannot presume that the debugger will be attached at the time an error occurs, and must export enough information to allow developers to act accordingly. In particular, a stacktrace is not guaranteed to be available.

TVMValue is a int64 union, and most embedded devices will struggle to deal with int64.

I am not so sure about this, @giuseros. I agree that int64_t are difficult to deal with on an 32-bit embedded system, but for the most part we are not passing int64_t to PackedFunc. We are passing void* v_handle. Would that not be dealt with by writing the upper word to 0 and just issuing a store-word instruction to the lower word?

I am happy to be proven wrong here–I stand by my earlier assertion of not depending on the compiler :). Just my feeling is that if we are doing lots of int64_t math, that is one thing, but stuffing the occasional parameter into int64_t at function call boundaries doesn’t seem particularly slow considering the remainder of the function bodies.

One area I think we could obviously improve is TVMBackendAllocWorkspace (e.g. replace the size with size_t), but also it’s possible we will just move away from that function for micro/AOT, so it may just be a non-issue for us? Given it would be a cross-runtime change, I sort of prefer to see how this shakes out first.

Function call signatures

@kparzysz

When you talk about firmware, are you thinking about firmware calling the graph runner? If such calls are infrequent, CPackedFunc should not be that much of a burden. I’d like to stick with CPackedFunc , because we already use it.

Yes, though in this case it could be either graph executor (neé runner) or AOT executor. The current plan of record for TVM is that all executors are to implement the Module-based Model Runtime Interface. We can consider changes to that interface, but that’s a separate RFC. I don’t believe there’s any reason why we can’t define a single interface which both AOT and Graph executors can implement. Implementing to a single interface gives significant benefits in the RPC and non-micro use cases of AOT.

I’ll address CPackedFunc vs another below:

@areusch @tqchen I guess that we can add a TVMBackendPackedCFunc wrapper function if the RPC side of the things need it. But is there any reason for not having the low level function written in plain C without typeid s?

@kparzysz Here I don’t think there is a huge runtime burden on the CPU to use TVMBackendPackedCFunc as the firmware-facing API. The burden is more on the developer–for instance, we have currently packed_func.h checked-in as utility functions to help with PackedFunc calls. But all of this is a lot to swallow when you could just be calling standard C functions.

It’s also difficult to document. One thing that really annoys me about the interface exported from libtvm_runtime.so and libtvm.so is that we actually just don’t document it. Why? Because there is no doxygen for PackedFunc. Yet, that is supposed to be the core TVM interface. In this situation, we actually wrote Python wrappers which also add convention (example) on top of the PackedFunc interface.

So while I understand it is a common calling convention in TVM, I don’t view what we have today for PackedFunc as sufficient to be a developer-facing interface. Adding a C shim layer is the same thing we do in Python today. I think we need significant improvements to our documentation generator tooling if we want to stick with a pure PackedFunc developer-facing API.

Actually, I think the best way to move forward would be to sketch something and progressively agree on how it looks. Have you got any suggestions on how to do this sort of “sketching”? Maybe a draft PR not meant to be merged but only to spark discussion?

Let’s spin the firmware-facing API into a separate RFC/PoC PR. @giuseros, do you want to propose something?

TVMValue s need to be packed/unpacked every time for every operator call

On the question of how should TVM internally call operator functions: I don’t have an opinion either way. I think that TQ has done some experimentation to show that with recent gcc and clang, the compiler does a lot to optimize out PackedFunc calls. However, I want to point out that on most embedded projects, developers have no say in the compiler they use, either because switching compilers would leave them without support from the SoC vendor for the hardware abstraction libraries, or because project timelines do not allow for the work needed to switch compilers. So I stick with my position that we should not expect anything from the compiler.

However, I think any such implementation would need to be done in the TIR lowering/codegen phase, rather than treating the calling convention as a codegen property. Keeping this implementation in TIR allows us to remove the PackedFunc overhead when it is meaningful to do so; while maintaining the ability to link with arbitrary PackedFunc should that be necessary, and maintaining the ability to wrap an arbitrary operator implementation in PackedFunc.

On this topic, @tqchen says:

X1 won’t provide any mechanism to detect such mismatch during runtime(if debug is enabled) while X0 allows us to provide type checking to do so.

In the C runtime case, I would posit that we are almost always building a static binary, so we can rely on the compiler’s type-checker to solve this problem and the linker to resolve references to functions.

→ PackedFunc lookup

Finally, a somewhat related topic is PackedFunc lookup. As we discussed before, we’ll likely replace string lookup with a function mangling scheme. This is yet another departure from the current implementation of tir.call_packed_lowered. In this modification, we avoid calling TVMBackendGetFuncFromEnv and instead just refer to the PackedFunc by mangled name. It seems that implementing this, we could stick with tir.call_packed_lowered, but create an attribute to indicate the choice between TVMBackendGetFuncFromEnv and the mangled function name.

Next steps

I’ve replied to the PR thread on my opinion how best to push forward on the existing AOT PR. That would be: merging the TIR top-level function plus anything strictly needed to test that, and leaving out pieces such as memory management and calling convention changes. I am not proposing we drop that work; instead I propose we keep discussing either here or as separate RFC, and implement those in follow-on PRs.

If we are in agreement with those next steps for the PR, I see these follow-on discussions:

Memory management: how best to implement embedded-friendly memory planning in place of GraphPlanMemory
Calling convention: Shall we propose a compilation mode for use with AOT which selectively drops PackedFunc for internal operator implementations?
Firmware-facing API: Which interface shall we expose to the firmware developer, and how? How should this impact the Module-based Model Runtime Interface?
Initialization flow: How shall we initialize the SoC (and any hardware needed)? Which hooks shall we provide from the executor to integrate with SoC?

I think each of these should be its own RFC in a follow-on thread. Currently, my time’s occupied on the Project API implementation–would anyone here like to spearhead any of the RFCs? If so, could you reply here indicating so, and after posting the follow-on RFC, link it from this thread.

Thanks! Andrew

tqchen · April 15, 2021, 9:29pm

Thanks @areusch . Just to clarify, X1 refers to another variant of the type-erased version of API(where inputs and outputs are passed by address) instead of the raw plain C API that have the right type signature.

Internal to the API it is certainly OK to go and try out the raw C API generation. We still need some form of type-erased interface for RPC, FFI purposes, as well as providing ways to provide customized operators(which does not have a fixed C signature). So my main comment is to use X0 when some form of type-erasure is needed. Additionally, the experiments shows that the compiler can be quite good optimizing things so maybe it is not necessary to go all the way to the raw C API(at least immediately) because benchmark that suggest such a need.

areusch · April 15, 2021, 9:50pm

Ah sorry, my previous reply did not treat your suggestion as a type-erased ABI.

The point I was trying to make was that we pretty much only pass DLTensor* for operator implementations, and the remaining PackedFunc types can be expressed as C arguments in a fairly straightforward fashion.

Given we have no need for an FFI in micro land, it should be entirely possible to rely on the underlying C compiler to do type signature checking, at least for implemented operators. I’m not saying we necessarily should do this, nor am I suggesting we go away from a type-erased ABI more generally. But, I’m wondering if there is much value in that in micro-land, and I think that it may be possible to maintain code to generate a non-type-erased ABI so long as we keep the constraint that all functions generated with non-type-erased signatures can also be generated as PackedFunc.

giuseros · April 16, 2021, 12:23pm

Hi all, About the PR, I agree. I will try to remove the controversial bits, so that we can have a separate discussion about the different points. Thanks @areusch , I really think this is a good idea.

While here, let’s try to agree on a general principle. @tqchen your experiment is very insightful, but I would like to consider things also from another perspective. The input packing/unpacking C code generated for a 32 bit convolution operator is the following:

TVM_DLL int32_t fused_nn_contrib_conv2d_NCHWc_1(void* args, void* arg_type_ids, int32_t num_args, void* out_ret_value, void* out_ret_tcode, void* resource_handle) {
  void* arg0 = (((TVMValue*)args)[0].v_handle);
  int32_t arg0_code = ((int32_t*)arg_type_ids)[(0)];
  void* arg1 = (((TVMValue*)args)[1].v_handle);
  int32_t arg1_code = ((int32_t*)arg_type_ids)[(1)];
  void* arg2 = (((TVMValue*)args)[2].v_handle);
  int32_t arg2_code = ((int32_t*)arg_type_ids)[(2)];
  void* placeholder = (((DLTensor*)arg0)[0].data);
  void* arg0_shape = (((DLTensor*)arg0)[0].shape);
  void* arg0_strides = (((DLTensor*)arg0)[0].strides);
  int32_t dev_id = (((DLTensor*)arg0)[0].device.device_id);
  void* placeholder1 = (((DLTensor*)arg1)[0].data);
  void* arg1_shape = (((DLTensor*)arg1)[0].shape);
  void* arg1_strides = (((DLTensor*)arg1)[0].strides);
  void* conv2d_NCHWc = (((DLTensor*)arg2)[0].data);
  void* arg2_shape = (((DLTensor*)arg2)[0].shape);
  void* arg2_strides = (((DLTensor*)arg2)[0].strides);
  if (!(arg0_strides == NULL)) {
  }
  if (!(arg1_strides == NULL)) {
  }
  if (!(arg2_strides == NULL)) {
  }
  //.....

I guess we agree that this code does not look “nice” (and I saw cases where the “header” of the function was even bigger). I understand that a good C compiler will get rid of many things here, but my general point is:

isn’t it better to have control on the native code generated so that we don’t have to rely on the low level compiler?

There is no “yes or no” answer to this question, but I would say that if this is technically simple to do, and we can provide the user with a choice, then I think we should do it.

Indeed, we have tried this out, and it is proven quite simple to get rid of all the data packing/unpacking. In terms of user experience, we can provide the user with a compiler flag use_packed_api that can turn on/off the packing header.

What do you think?

tqchen · April 16, 2021, 1:35pm

Thanks @giuseros for great discussions .

I ran a quick experiment to make the code match the generated code(e.g. use void* instead of TVMValue* in arguments), and the result remains the same. So indeed low level compiler indeed will get rid of most, if not all overheads, modulo readability concerns (which is not a goal of the C codegen atm).

To the question about whether or not shall we enable generating raw C API operators. I think the answer is possibly yes. As a matter of fact, we do have ability to generate the unpacked API in the pass level by toggle the num_unpacked_args parameters in the MakePackedAPI, and the code will follow C API convention So if AOT impl want to directly implement the call via C API call into those generated functions in C ABI, we should certainly support them.

On the other hand, we should also recognize the importance of a type-erased API on cases that need them:

C0: When we try to first run it on a x86 host to test out the flow and want to poke around the code using python.
C1: When we want to allow thirdparty developer to plugin their own operators without tweaking the AOT compiler (thus need a stable type erased interface for those operator functions).
C2: When we want to poke around what is going on in each of the unit-functions running through RPC.
C3: Accelerator runtime case that have its own calling/launching convention, we almost certainly requires some form of wrapping. However, we cannot introduce raw C API for every variant of accelerator launching, that would create another burden to the compiler for having to update the runtime interface per new accelerator feature.

My main point is that we should adopt CPackedFunc in cases where we need a type-erased API, instead of inventing a new one. Additionally, the experiments shows that even lazily adopting the CPackedFunc everywhere may not be a bad starting point afterall. In the meantime, we also remember the goal of developer experience could mean a continuous shift of targets:

make sure things run on local setting first
try things out through rpc
final deployment stage

So my main thinking is the consistency issue. Ideally, anything that is supported in our official C embedded API, should be able to runnable on x86 platform, directly invokable in python, and can also be poked remotely through rpc in my jupyter notebook. Having additional (optional) PackedFunc layer is important to enable such a development experience.

When it comes to deployment stage(after all tests are over), I agree that enabling C API calls could be useful (even tnough not strictly necessary in light of the low level compiler optimizations). My main worry is when we start to think about raw C API, we go too far and create a divergence that disallows the above developing experience, which I think is equally important as the final deployment efficiency.

To re-summarize my thinking so far:

If the operator is only being used internally, we can use the num_unpacked_args to directly translate into C API calls.
For public facing APIs, e.g. set/get/run and RPC facing benchmarking, we will need to optionally support CPackedFunc variant so we continue to enable support like python poking, RPC autotuning etc, and get rid of them in the final deployment stage.
For cases where a type-erased API is needed(e.g. accelerator, custom ops), try not to re-invent another type-erased API, and directly adopt CPackedFunc. Compiler optimizations likely will get rid of most overheads

tqchen · April 16, 2021, 1:48pm

To add on to the PackedFunc string lookup. I agree we should get rid of them. The idea is to distinguish PackedFunc(which is opaque and requires TVMFuncCall) from CPackedFunc(which have a stable C API signature and symbol and we can directly jump to symbol).

We can introduce tir.call_cpacked_lowered and tir.call_cpacked to clearly distinguish the two and make different lowering path for them. We should definitely go with the tir.call_cpacked_lowered in AOT so the above inline optimization can happen.

areusch · April 16, 2021, 3:48pm

@tqchen @giuseros Thanks TQ for that summary–I agree with everything you said here. Adopting a way to selectively remove CPackedFunc could be useful.

Another thing I want to bring up here is that when you’re running your experiments, it would be great to verify any of your findings at -O0. Why? Because this is the “debug” use case, corresponding to when something is going wrong with the generated operator code. It should always be possible to run our generated operators at -O0 in order to trace problems with the implementation. It’s particularly upsetting as a developer when you’re trying to debug a problem with an operator, but the debug (e.g. unoptimized) build behaves differently than the release (e.g. -O2 -Os) build. While this is a general problem, what we could be doing here is drastically changing the stack requirements between -O0 and -O2. In many cases, this requires bumping the global memory alloted for stack (if you are lucky and working with a platform that can catch stack overflow–otherwise, you just get weird failures that occur at some point seemingly-unrelated to when the stack overwrote global memory).

I think @giuseros was mainly trying here to address the question of: “is it valuable to TVM to generate readable code?” My inclination is to say “yes where practical” here, though I don’t know that we do this at main now. I think the cost of this is the additional complexity in CodeGenCHost and associated unit tests. My feeling is that this just hasn’t been addressed to date since compiler optimization does often make this a no-op; I’d still be in favor of making improvements here in the future.

We can introduce tir.call_cpacked_lowered and tir.call_cpacked to clearly distinguish the two and make different lowering path for them. We should definitely go with the tir.call_cpacked_lowered in AOT so the above inline optimization can happen.

Yeah I agree with distinguishing between these two output paths in TIR rather than using a codegen flag.