Summary
This RFC proposes an alternative path in the compiler which can remove several core structures from the output of the AOT compiler, such as TVMValue and DLTensor. By doing this we remove any overheads introduced by DLTensor as well as any dependency on DLPack in the output code as well as enabling TVM to run without a runtime in embedded environments.
- Optional removal of DLTensor
- Optional removal of TVMValue
- Optional unpacking of function calls
Motivation
There are two main motivations here, user experience and the structures which are unused in the eventual output.
User Experience
In many existing embedded applications, integrating third party code is not a straight forward process as the system is often designed with several hard constraints. By reducing the amount of files we need to transfer and ensuring the most transparent set of interfaces we can minimise the overhead of integrating uTVM into an existing application. This means that by reducing the overall dependencies, such as removing the need for DLPack, we can reduce the amount of foreign code required. By further reducing the amount of overhead we may be able to reduce the integration to a bare minimum where features of the C runtime are not strictly required.
The debugging experience is also much better, by providing the raw unpacked functions throughout the embedded code it’s easier to step through and understand where arguments come from and are being used. The indirection which makes packed functions useful in a more dynamic environment is hampering when running the generated code more directly.
Unused Structures
When we packed and unpack values, we make use of the data
portion of a DLTensor but nothing else, this leads to a lot of the structure being unused. In embedded systems space can become an absolute premium and this unused structure consumes precious bytes. This could also be extended further in that DLPack itself is a third party integration which may change in size or shape. This is similarly true of TVMValue
which is aligned to a larger size than a pure pointer yet we only use the pointer aspect of it. Many of the DLTensor fields are 64-bit values such as the shape
, this is optimal for modern 64-bit processors but embedded processors are limited to 32-bit values with limited registers to use for optimising calls.
The packing/unpacking itself can also require additional instructions rather than being optimised for use with the registers set aside for function calls - for example on Cortex M0 the arguments can be passed in registers r0-r3 directly rather than loading offsets of the first parameter into the remaining registers (void* arg1 = args[(0)]).
In my experiments, this scales with the number of operators and intermediary stacks required - for microspeech by stepping through these optimisations you can see a reduction not only in code size but also in stack size (which is less obvious when looking at footprint sizes). Below is a table detailing the incremental effects of each optimisation and the cumulative impact of several optimisations applied. These represent not just code sizes but also cycle times and power usage, further to this the stack savings would allow such a model to run under Zephyr on an M0 which by default is allocated only small stack sizes (see: zephyr/stm32f0_disco_defconfig at master · zephyrproject-rtos/zephyr · GitHub).
Model | Optimisations | Text | Data | BSS | Total | Individual Code Size Savings | Cumulative Code Size Savings | Max Stack Size | Individual Stack Savings | Cumulative Stack Savings |
---|---|---|---|---|---|---|---|---|---|---|
Microspeech | AOT+No DLTensor+No TVMValue+Unpacked | 40556 | 672 | 36 | 41264 | 200 | 560 | 48 | 96 | 616 |
Microspeech | AOT+No DLTensor+No TVMValue | 40756 | 672 | 36 | 41464 | 96 | 360 | 144 | 72 | 520 |
Microspeech | AOT+No DLTensor | 40852 | 672 | 36 | 41560 | 264 | 264 | 216 | 448 | 448 |
Microspeech | AOT | 41108 | 672 | 44 | 41824 | 0 | 0 | 664 | 0 | 0 |
Also, I tried using a single translation unit with all operators marked as static (simulating LTO) which wasn’t as optimal and doesn’t remove the DLPack dependency nor improve the debugging experience:
Model | Optimisations | Text | Data | BSS | Total | Individual Code Size Savings | Cumulative Code Size Savings | Max Stack Size | Individual Stack Savings | Cumulative Stack Savings |
---|---|---|---|---|---|---|---|---|---|---|
Microspeech | AOT+Single Translation Unit | 40532 | 672 | 44 | 41248 | 576 | 576 | 104 | 560 | 560 |
Guide-level explanation
When generating code from TVM using an optional flag (passed through to the .build
function), which represents an embedded target, such as:
tvmc --target="c" --unpack-functions --executor=aot
tvmc --target="llvm" --unpack-functions --executor=aot
This will produce standalone code which is optimised for running directly on an embedded device, you can still compile these directly as usual and functions that provide a PrimFunc as an entrypoint can be packed appropriately; the “packed function” API changes from using TVMValue/DLTensor as a proxy to a pointer, moving from:
TVM_DLL int32_t fused_reshape(void* args, void* arg_type_ids, int32_t num_args, void* out_ret_value, void* out_ret_tcode, void* resource_handle) {
void* arg0 = (((TVMValue*)args)[0].v_handle);
int32_t arg0_code = ((int32_t*)arg_type_ids)[(0)];
void* arg1 = (((TVMValue*)args)[1].v_handle);
int32_t arg1_code = ((int32_t*)arg_type_ids)[(1)];
void* placeholder = (((DLTensor*)arg0)[0].data);
void* arg0_shape = (((DLTensor*)arg0)[0].shape);
void* arg0_strides = (((DLTensor*)arg0)[0].strides);
int32_t dev_id = (((DLTensor*)arg0)[0].ctx.device_id);
void* T_reshape = (((DLTensor*)arg1)[0].data);
void* arg1_shape = (((DLTensor*)arg1)[0].shape);
void* arg1_strides = (((DLTensor*)arg1)[0].strides);
if (!(arg0_strides == NULL)) {
}
if (!(arg1_strides == NULL)) {
}
((float*)T_reshape)[(0)] = ((float*)placeholder)[(0)];
return 0;
}
To this a slimmer unpacked function API which still has variables assigned to match the internal packed call but is incompatible with the dynamic loading approach:
TVM_DLL int32_t fused_reshape(void* arg0, void* arg1) {
void* placeholder = arg0;
int32_t dev_id = 0;
void* T_reshape = arg1;
((float*)T_reshape)[(0)] = ((float*)placeholder)[(0)];
return 0;
}
Reference-level explanation
TVMValue
Other than changing the AOT output itself, there’s two main files in TVM that have to be changed. lower_tvm_builtin.cc
needs to be able to allocate a stack directly:
inline Stmt StackAlloca(tir::Var& stack_var, DataType stack_dtype, int num, tir::Stmt stmt) {
- Array<PrimExpr> args = {StringImm(type), ConstInt32(num)};
- return Call(DataType::Handle(), builtin::tvm_stack_alloca(), args);
+ stmt = tir::Allocate(
+ stack_var,
+ stack_dtype,
+ {PrimExpr(num)},
+ tir::const_true(),
+ stmt
+ );
+ stmt = tir::AttrStmt(stack_var, tir::attr::storage_scope, tir::StringImm("global"), stmt);
....
- Var stack_shape = Var("stack_shape", DataType::Handle());
- Var stack_array = Var("stack_array", DataType::Handle());
- Var stack_value = Var("stack_value", DataType::Handle());
- Var stack_tcode = Var("stack_tcode", DataType::Handle());
+ stack_shape_ = Var("stack_shape", PointerType(PrimType(DataType::Handle())));
+ stack_array_ = Var("stack_array", PointerType(PrimType(DataType::Handle())));
+ stack_value_ = Var("stack_value", PointerType(PrimType(DataType::Handle())));
+ stack_tcode_ = Var("stack_tcode", PointerType(PrimType(DataType::Handle())));
...
Stmt Build(Stmt stmt) {
- stack_shape_ = Var("stack_shape", DataType::Handle());
- stack_array_ = Var("stack_array", DataType::Handle());
- stack_value_ = Var("stack_value", DataType::Handle());
- stack_tcode_ = Var("stack_tcode", DataType::Handle());
+ stack_shape_ = Var("stack_shape", PointerType(PrimType(DataType::Handle())));
+ stack_array_ = Var("stack_array", PointerType(PrimType(DataType::Handle())));
+ stack_value_ = Var("stack_value", PointerType(PrimType(DataType::Handle())));
+ stack_tcode_ = Var("stack_tcode", PointerType(PrimType(DataType::Handle())));
stmt = this->VisitStmt(stmt);
// create a shape var if any shape is made (including scalar shapes)
if (max_shape_stack_ != -1) {
- stmt = LetStmt(stack_shape_, StackAlloca("shape", max_shape_stack_), stmt);
+ stmt = StackAlloca(stack_shape_, DataType::Handle(), max_shape_stack_, stmt);
}
if (max_array_stack_ != 0) {
- stmt = LetStmt(stack_array_, StackAlloca("array", max_array_stack_), stmt);
+ stmt = StackAlloca(stack_array_, DataType::Handle(), max_array_stack_, stmt);
}
if (max_arg_stack_ != 0) {
- stmt = LetStmt(stack_value_, StackAlloca("arg_value", max_arg_stack_), stmt);
- stmt = LetStmt(stack_tcode_, StackAlloca("arg_tcode", max_arg_stack_), stmt);
+ stmt = StackAlloca(stack_value_, DataType::Handle(), max_arg_stack_, stmt);
+ stmt = StackAlloca(stack_tcode_, DataType::Handle(), max_arg_stack_, stmt);
}
return stmt;
Once this is configured, the code generation in make_packed_api.cc
needs changes to directly load from the stack variables:
- Array<PrimExpr> call_args{v_packed_args, IntImm(DataType::Int(32), i),
- IntImm(DataType::Int(32), builtin::kTVMValueContent)};
// load 64 bit version
DataType api_type = APIType(t);
- PrimExpr res = Call(api_type, builtin::tvm_struct_get(), call_args);
// cast to the target version.
- if (api_type != t) {
- res = Cast(t, res);
- }
+ auto res = tir::Load(api_type, v_packed_args, i, tir::const_true());
DLTensor
A minimal way to implement this change is to change the output bindings from using DLTensor to using a pointer; by only changing the output bindings the internals of TVM can continue to use DLTensor for other passes such as the constant folding. This requires changes in the AOT code generator (aot_codegen.cc) to remove DLTensor generation, the packed function generator (make_packed_api.cc) to choose the correct binding and lastly changes in the argument binder (arg_binder.cc) to surface this as an alternative.
An example of the smaller argument binder:
void ArgBinder::BindPointer(const Buffer& buffer, const PrimExpr& device_type,
const PrimExpr& device_id, const Var& handle,
const std::string& arg_name) {
const Stmt nop = Evaluate(0);
if (Bind_(buffer->data, handle, arg_name + ".data", true)) {
Var vptr(buffer->data);
def_handle_dtype_.Set(vptr, tir::TypeAnnotation(buffer->dtype));
// mark alignment of external bufs
init_nest_.emplace_back(AttrStmt(vptr, tir::attr::storage_alignment,
IntImm(DataType::Int(32), buffer->data_alignment), nop));
}
Bind_(device_type, Integer(1), arg_name + ".device_type", true);
Bind_(device_id, Integer(0), arg_name + ".device_id", true);
}
This removes all unnecessary binding of the DLTensor data and binds the handle directly instead of using an array as BindDLTensor does:
if (Bind_(buffer->data, TVMArrayGet(DataType::Handle(), handle, builtin::kArrData),
arg_name + ".data", true)) {
One issue is that device_type and device_id are checked later and must be bound to pass the invariant checks.
Unpacked AOT Entry Function
This allows us to call directly in with inputs
and outputs
without packing them inside of DLTensor/TVMValue, using a signature similar to:
typedef int32_t(tvm_function_t)(void** inputs, void** outputs, void* resource_handle);
The advantage here is that it can itself unpack the passed pointers directly and propogate the resource handle where required so the application writer doesn’t need to. This differs from operators where the code generator knows the expected layout of the arguments.
To do this, the cleanest way seems to be providing a way of the AOT entry function to be skipped during the initial tir passes in make_packed_api.cc
:
// AOT entrypoint pipeline
auto aot_pass_list = {FilterCallingConv(CallingConv::kAOTEntryPoint)};
auto opt_aot = transform::Sequential(aot_pass_list);
auto mod_aot = opt_aot(mod_mixed);
mixed_pass_list.push_back(FilterNotCallingConv(CallingConv::kAOTEntryPoint));
if (pass_ctx->GetConfig<Bool>("tir.detect_global_barrier", Bool(false)).value()) {
mixed_pass_list.push_back(tir::transform::ThreadSync("global"));
}
mixed_pass_list.push_back(tir::transform::ThreadSync("shared"));
mixed_pass_list.push_back(tir::transform::ThreadSync("warp"));
mixed_pass_list.push_back(tir::transform::InferFragment());
mixed_pass_list.push_back(tir::transform::LowerThreadAllreduce());
mixed_pass_list.push_back(tir::transform::MakePackedAPI(0));
mixed_pass_list.push_back(tir::transform::SplitHostDevice());
auto opt_mixed = transform::Sequential(mixed_pass_list);
mod_mixed = opt_mixed(std::move(mod_mixed));
// Reintroduce AOT function for host passes
mod_mixed->Update(mod_aot);
Unpacked Function Calls
It still makes sense to pass function calls through MakePackedAPI
in order to allow the code generator to match up inputs and outputs effectively, but instead of providing the fully packed API we instead ask it to spread the arguments. A cleaner variant of:
@@ -120,6 +120,9 @@ PrimFunc MakePackedAPI(PrimFunc&& func, int num_unpacked_args) {
auto* func_ptr = func.CopyOnWrite();
const Stmt nop = Evaluate(0);
int num_args = static_cast<int>(func_ptr->params.size());
+ if (executor == "aot") {
+ num_unpacked_args = num_args;
+ }
ICHECK_LE(num_unpacked_args, num_args);
int num_packed_args = num_args - num_unpacked_args;
With this, the AOT code generator can be updated to instead emit an appropriate tir op, in this example I used call_extern
directly but for better alignment on meaning we could introduce a call_unpacked
or consider the optimisation part of call_cpacked:
@@ -255,7 +255,7 @@ class AOTCodegen : public ExprVisitor {
// Use tvm_call_packed to execute the function
func_call_stmts.push_back(tir::Evaluate(
- tvm::tir::Call(DataType::Int(32), tvm::tir::builtin::tvm_call_packed(), args)));
+ tvm::tir::Call(DataType::Int(32), tvm::tir::builtin::call_extern(), args)));
tir::Stmt body = tir::SeqStmt(func_call_stmts);
stmts_.push_back(body);
}
Prior Art
- This builds upon the AOT work: Implementing AOT in TVM
- Unlike the APIs used in Tensorflow Lite Micro which uses a C++ API, the aim is to use a much reduced API surface and introduce minimal overhead to enable us to work within a smaller more constrained environment.
- ST presented an alternative path by extending upon the existing constructs - [RFC] Standalone Code Generation and C Runtime for STM32 bare-metal devices
Drawbacks
- DLTensor contains all of the metadata about a tensor in memory, this means that in languages which can use this the information is likely to be lost. This can be mitigated by wrapping the minimal API but the internal DLTensor checks inside of the operators will be lost.
- TVMValue is a fundamental part of the C runtime and thus this breaks compatibility as this is designed to be a standalone rather than dynamically linkable
- The packed API now has two variants, one fully packed and one used as a translation layer between the operators and the calling code with spread arguments
Rationale and alternatives
Taking this approach has immediate benefits to reducing the overheads of a compiled TVM model and can be built upon if the abstraction is required in future. Alternative approaches considered:
- An embedded-specific DLTensor and TVMValue, this would be a resized variant designed to run on 32-bit or less embedded systems.
- Continue using DLTensor and TVMValue to align with the C runtime, continuing to incur the overhead and unable to shrink to the smallest targets
- Do this all as the default AOT behaviour for now rather than providing a compiler flag
- Maintain current packed function signature and instead just change the unwrapping from DLTensor to pointers - this is problematic as to which level the user is informed of an error, with a changed signature you’d get a link error rather than a segfault if you tried to use this for dynamic linking
- Leverage link time optimisation to minimise final code size, this hasn’t been used significantly in the embedded space due to it potentially optimising code into slower areas of a device (see: The Best and Worst GCC Compiler Flags For Embedded | Interrupt / Link Time Optimization - API references and tutorials | Mbed OS 6 Documentation)
Unresolved questions
- Can/should we remove the device_type/device_id which are checked in the invariants?
- What should the flag be called?
--unpack-functions
,–tiny
,--no-runtime
,–micro
etc?
Future possibilities
- Introducing a tested and supported way to produce a minimal output gives us a number of possibilities for other deployment environments in the future where we may want to toggle only certain pieces of this dynamisism.
- By reducing the baseline footprint and enabling running in constrained stack size, TVM can continue to be optimised to enable us to run small models on resource constrained devices - opening up the usage of TVM to a lot of useful and innovative use cases