[RFC] [uTVM] AOT optimisations for Embedded Targets

Summary

This RFC proposes an alternative path in the compiler which can remove several core structures from the output of the AOT compiler, such as TVMValue and DLTensor. By doing this we remove any overheads introduced by DLTensor as well as any dependency on DLPack in the output code as well as enabling TVM to run without a runtime in embedded environments.

  • Optional removal of DLTensor
  • Optional removal of TVMValue
  • Optional unpacking of function calls

Motivation

There are two main motivations here, user experience and the structures which are unused in the eventual output.

User Experience

In many existing embedded applications, integrating third party code is not a straight forward process as the system is often designed with several hard constraints. By reducing the amount of files we need to transfer and ensuring the most transparent set of interfaces we can minimise the overhead of integrating uTVM into an existing application. This means that by reducing the overall dependencies, such as removing the need for DLPack, we can reduce the amount of foreign code required. By further reducing the amount of overhead we may be able to reduce the integration to a bare minimum where features of the C runtime are not strictly required.

The debugging experience is also much better, by providing the raw unpacked functions throughout the embedded code it’s easier to step through and understand where arguments come from and are being used. The indirection which makes packed functions useful in a more dynamic environment is hampering when running the generated code more directly.

Unused Structures

When we packed and unpack values, we make use of the data portion of a DLTensor but nothing else, this leads to a lot of the structure being unused. In embedded systems space can become an absolute premium and this unused structure consumes precious bytes. This could also be extended further in that DLPack itself is a third party integration which may change in size or shape. This is similarly true of TVMValue which is aligned to a larger size than a pure pointer yet we only use the pointer aspect of it. Many of the DLTensor fields are 64-bit values such as the shape , this is optimal for modern 64-bit processors but embedded processors are limited to 32-bit values with limited registers to use for optimising calls.

The packing/unpacking itself can also require additional instructions rather than being optimised for use with the registers set aside for function calls - for example on Cortex M0 the arguments can be passed in registers r0-r3 directly rather than loading offsets of the first parameter into the remaining registers (void* arg1 = args[(0)]).

In my experiments, this scales with the number of operators and intermediary stacks required - for microspeech by stepping through these optimisations you can see a reduction not only in code size but also in stack size (which is less obvious when looking at footprint sizes). Below is a table detailing the incremental effects of each optimisation and the cumulative impact of several optimisations applied. These represent not just code sizes but also cycle times and power usage, further to this the stack savings would allow such a model to run under Zephyr on an M0 which by default is allocated only small stack sizes (see: zephyr/stm32f0_disco_defconfig at master · zephyrproject-rtos/zephyr · GitHub).

Model Optimisations Text Data BSS Total Individual Code Size Savings Cumulative Code Size Savings Max Stack Size Individual Stack Savings Cumulative Stack Savings
Microspeech AOT+No DLTensor+No TVMValue+Unpacked 40556 672 36 41264 200 560 48 96 616
Microspeech AOT+No DLTensor+No TVMValue 40756 672 36 41464 96 360 144 72 520
Microspeech AOT+No DLTensor 40852 672 36 41560 264 264 216 448 448
Microspeech AOT 41108 672 44 41824 0 0 664 0 0

Also, I tried using a single translation unit with all operators marked as static (simulating LTO) which wasn’t as optimal and doesn’t remove the DLPack dependency nor improve the debugging experience:

Model Optimisations Text Data BSS Total Individual Code Size Savings Cumulative Code Size Savings Max Stack Size Individual Stack Savings Cumulative Stack Savings
Microspeech AOT+Single Translation Unit 40532 672 44 41248 576 576 104 560 560

Guide-level explanation

When generating code from TVM using an optional flag (passed through to the .build function), which represents an embedded target, such as:

  • tvmc --target="c" --unpack-functions --executor=aot
  • tvmc --target="llvm" --unpack-functions --executor=aot

This will produce standalone code which is optimised for running directly on an embedded device, you can still compile these directly as usual and functions that provide a PrimFunc as an entrypoint can be packed appropriately; the “packed function” API changes from using TVMValue/DLTensor as a proxy to a pointer, moving from:

TVM_DLL int32_t fused_reshape(void* args, void* arg_type_ids, int32_t num_args, void* out_ret_value, void* out_ret_tcode, void* resource_handle) {
  void* arg0 = (((TVMValue*)args)[0].v_handle);
  int32_t arg0_code = ((int32_t*)arg_type_ids)[(0)];
  void* arg1 = (((TVMValue*)args)[1].v_handle);
  int32_t arg1_code = ((int32_t*)arg_type_ids)[(1)];
  void* placeholder = (((DLTensor*)arg0)[0].data);
  void* arg0_shape = (((DLTensor*)arg0)[0].shape);
  void* arg0_strides = (((DLTensor*)arg0)[0].strides);
  int32_t dev_id = (((DLTensor*)arg0)[0].ctx.device_id);
  void* T_reshape = (((DLTensor*)arg1)[0].data);
  void* arg1_shape = (((DLTensor*)arg1)[0].shape);
  void* arg1_strides = (((DLTensor*)arg1)[0].strides);
  if (!(arg0_strides == NULL)) {
  }
  if (!(arg1_strides == NULL)) {
  }
  ((float*)T_reshape)[(0)] = ((float*)placeholder)[(0)];
  return 0;
}

To this a slimmer unpacked function API which still has variables assigned to match the internal packed call but is incompatible with the dynamic loading approach:

TVM_DLL int32_t fused_reshape(void* arg0, void* arg1) {
  void* placeholder = arg0;
  int32_t dev_id = 0;
  void* T_reshape = arg1;
  ((float*)T_reshape)[(0)] = ((float*)placeholder)[(0)];
  return 0;
}

Reference-level explanation

TVMValue

Other than changing the AOT output itself, there’s two main files in TVM that have to be changed. lower_tvm_builtin.cc needs to be able to allocate a stack directly:

inline Stmt StackAlloca(tir::Var& stack_var, DataType stack_dtype, int num, tir::Stmt stmt) {
- Array<PrimExpr> args = {StringImm(type), ConstInt32(num)};
- return Call(DataType::Handle(), builtin::tvm_stack_alloca(), args);
 
+   stmt = tir::Allocate(
+    stack_var,
+    stack_dtype,
+    {PrimExpr(num)},
+    tir::const_true(),
+    stmt
+  );
+  stmt = tir::AttrStmt(stack_var, tir::attr::storage_scope, tir::StringImm("global"), stmt);
 
....
 
-    Var stack_shape = Var("stack_shape", DataType::Handle());
-    Var stack_array = Var("stack_array", DataType::Handle());
-    Var stack_value = Var("stack_value", DataType::Handle());
-    Var stack_tcode = Var("stack_tcode", DataType::Handle());
+    stack_shape_ = Var("stack_shape", PointerType(PrimType(DataType::Handle())));
+    stack_array_ = Var("stack_array", PointerType(PrimType(DataType::Handle())));
+    stack_value_ = Var("stack_value", PointerType(PrimType(DataType::Handle())));
+    stack_tcode_ = Var("stack_tcode", PointerType(PrimType(DataType::Handle())));
 
...
 
   Stmt Build(Stmt stmt) {
-    stack_shape_ = Var("stack_shape", DataType::Handle());
-    stack_array_ = Var("stack_array", DataType::Handle());
-    stack_value_ = Var("stack_value", DataType::Handle());
-    stack_tcode_ = Var("stack_tcode", DataType::Handle());
+    stack_shape_ = Var("stack_shape", PointerType(PrimType(DataType::Handle())));
+    stack_array_ = Var("stack_array", PointerType(PrimType(DataType::Handle())));
+    stack_value_ = Var("stack_value", PointerType(PrimType(DataType::Handle())));
+    stack_tcode_ = Var("stack_tcode", PointerType(PrimType(DataType::Handle())));
     stmt = this->VisitStmt(stmt);
     // create a shape var if any shape is made (including scalar shapes)
     if (max_shape_stack_ != -1) {
-      stmt = LetStmt(stack_shape_, StackAlloca("shape", max_shape_stack_), stmt);
+      stmt = StackAlloca(stack_shape_, DataType::Handle(), max_shape_stack_, stmt);
     }
     if (max_array_stack_ != 0) {
-      stmt = LetStmt(stack_array_, StackAlloca("array", max_array_stack_), stmt);
+      stmt = StackAlloca(stack_array_, DataType::Handle(), max_array_stack_, stmt);
     }
     if (max_arg_stack_ != 0) {
-      stmt = LetStmt(stack_value_, StackAlloca("arg_value", max_arg_stack_), stmt);
-      stmt = LetStmt(stack_tcode_, StackAlloca("arg_tcode", max_arg_stack_), stmt);
+      stmt = StackAlloca(stack_value_, DataType::Handle(), max_arg_stack_, stmt);
+      stmt = StackAlloca(stack_tcode_, DataType::Handle(), max_arg_stack_, stmt);
     }
     return stmt;

Once this is configured, the code generation in make_packed_api.cc needs changes to directly load from the stack variables:

-    Array<PrimExpr> call_args{v_packed_args, IntImm(DataType::Int(32), i),
-                              IntImm(DataType::Int(32), builtin::kTVMValueContent)};
     // load 64 bit version
     DataType api_type = APIType(t);
-    PrimExpr res = Call(api_type, builtin::tvm_struct_get(), call_args);
     // cast to the target version.
-    if (api_type != t) {
-      res = Cast(t, res);
-    }
+    auto res = tir::Load(api_type, v_packed_args, i, tir::const_true());

DLTensor

A minimal way to implement this change is to change the output bindings from using DLTensor to using a pointer; by only changing the output bindings the internals of TVM can continue to use DLTensor for other passes such as the constant folding. This requires changes in the AOT code generator (aot_codegen.cc) to remove DLTensor generation, the packed function generator (make_packed_api.cc) to choose the correct binding and lastly changes in the argument binder (arg_binder.cc) to surface this as an alternative.

An example of the smaller argument binder:

void ArgBinder::BindPointer(const Buffer& buffer, const PrimExpr& device_type,
                             const PrimExpr& device_id, const Var& handle,
                             const std::string& arg_name) {
  const Stmt nop = Evaluate(0);
 
  if (Bind_(buffer->data, handle, arg_name + ".data", true)) {
    Var vptr(buffer->data);
    def_handle_dtype_.Set(vptr, tir::TypeAnnotation(buffer->dtype));
    // mark alignment of external bufs
    init_nest_.emplace_back(AttrStmt(vptr, tir::attr::storage_alignment,
                                     IntImm(DataType::Int(32), buffer->data_alignment), nop));
  }
 
  Bind_(device_type, Integer(1), arg_name + ".device_type", true);
  Bind_(device_id, Integer(0), arg_name + ".device_id", true);
}

This removes all unnecessary binding of the DLTensor data and binds the handle directly instead of using an array as BindDLTensor does:

if (Bind_(buffer->data, TVMArrayGet(DataType::Handle(), handle, builtin::kArrData),
          arg_name + ".data", true)) {

One issue is that device_type and device_id are checked later and must be bound to pass the invariant checks.

Unpacked AOT Entry Function

This allows us to call directly in with inputs and outputs without packing them inside of DLTensor/TVMValue, using a signature similar to:

typedef int32_t(tvm_function_t)(void** inputs, void** outputs, void* resource_handle);

The advantage here is that it can itself unpack the passed pointers directly and propogate the resource handle where required so the application writer doesn’t need to. This differs from operators where the code generator knows the expected layout of the arguments.

To do this, the cleanest way seems to be providing a way of the AOT entry function to be skipped during the initial tir passes in make_packed_api.cc :

// AOT entrypoint pipeline
auto aot_pass_list = {FilterCallingConv(CallingConv::kAOTEntryPoint)};
   
auto opt_aot = transform::Sequential(aot_pass_list);
auto mod_aot = opt_aot(mod_mixed);
   
mixed_pass_list.push_back(FilterNotCallingConv(CallingConv::kAOTEntryPoint));
if (pass_ctx->GetConfig<Bool>("tir.detect_global_barrier", Bool(false)).value()) {
  mixed_pass_list.push_back(tir::transform::ThreadSync("global"));
}
mixed_pass_list.push_back(tir::transform::ThreadSync("shared"));
mixed_pass_list.push_back(tir::transform::ThreadSync("warp"));
mixed_pass_list.push_back(tir::transform::InferFragment());
mixed_pass_list.push_back(tir::transform::LowerThreadAllreduce());
mixed_pass_list.push_back(tir::transform::MakePackedAPI(0));
mixed_pass_list.push_back(tir::transform::SplitHostDevice());
auto opt_mixed = transform::Sequential(mixed_pass_list);
mod_mixed = opt_mixed(std::move(mod_mixed));
// Reintroduce AOT function for host passes
mod_mixed->Update(mod_aot);

Unpacked Function Calls

It still makes sense to pass function calls through MakePackedAPI in order to allow the code generator to match up inputs and outputs effectively, but instead of providing the fully packed API we instead ask it to spread the arguments. A cleaner variant of:

@@ -120,6 +120,9 @@ PrimFunc MakePackedAPI(PrimFunc&& func, int num_unpacked_args) {
   auto* func_ptr = func.CopyOnWrite();
   const Stmt nop = Evaluate(0);
   int num_args = static_cast<int>(func_ptr->params.size());
+  if (executor == "aot") {
+    num_unpacked_args = num_args;
+  }
   ICHECK_LE(num_unpacked_args, num_args);
  
   int num_packed_args = num_args - num_unpacked_args;

With this, the AOT code generator can be updated to instead emit an appropriate tir op, in this example I used call_extern directly but for better alignment on meaning we could introduce a call_unpacked or consider the optimisation part of call_cpacked:

@@ -255,7 +255,7 @@ class AOTCodegen : public ExprVisitor {
  
     // Use tvm_call_packed to execute the function
     func_call_stmts.push_back(tir::Evaluate(
-        tvm::tir::Call(DataType::Int(32), tvm::tir::builtin::tvm_call_packed(), args)));
+        tvm::tir::Call(DataType::Int(32), tvm::tir::builtin::call_extern(), args)));
     tir::Stmt body = tir::SeqStmt(func_call_stmts);
     stmts_.push_back(body);
   }

Prior Art

Drawbacks

  • DLTensor contains all of the metadata about a tensor in memory, this means that in languages which can use this the information is likely to be lost. This can be mitigated by wrapping the minimal API but the internal DLTensor checks inside of the operators will be lost.
  • TVMValue is a fundamental part of the C runtime and thus this breaks compatibility as this is designed to be a standalone rather than dynamically linkable
  • The packed API now has two variants, one fully packed and one used as a translation layer between the operators and the calling code with spread arguments

Rationale and alternatives

Taking this approach has immediate benefits to reducing the overheads of a compiled TVM model and can be built upon if the abstraction is required in future. Alternative approaches considered:

  • An embedded-specific DLTensor and TVMValue, this would be a resized variant designed to run on 32-bit or less embedded systems.
  • Continue using DLTensor and TVMValue to align with the C runtime, continuing to incur the overhead and unable to shrink to the smallest targets
  • Do this all as the default AOT behaviour for now rather than providing a compiler flag
  • Maintain current packed function signature and instead just change the unwrapping from DLTensor to pointers - this is problematic as to which level the user is informed of an error, with a changed signature you’d get a link error rather than a segfault if you tried to use this for dynamic linking
  • Leverage link time optimisation to minimise final code size, this hasn’t been used significantly in the embedded space due to it potentially optimising code into slower areas of a device (see: The Best and Worst GCC Compiler Flags For Embedded | Interrupt / Link Time Optimization - API references and tutorials | Mbed OS 6 Documentation)

Unresolved questions

  • Can/should we remove the device_type/device_id which are checked in the invariants?
  • What should the flag be called? --unpack-functions, –tiny, --no-runtime, –micro etc?

Future possibilities

  • Introducing a tested and supported way to produce a minimal output gives us a number of possibilities for other deployment environments in the future where we may want to toggle only certain pieces of this dynamisism.
  • By reducing the baseline footprint and enabling running in constrained stack size, TVM can continue to be optimised to enable us to run small models on resource constrained devices - opening up the usage of TVM to a lot of useful and innovative use cases

cc: @manupa-arm @tqchen @areusch @giuseros @stoa

6 Likes

Thanks @Mousius for bringing this up. I think it would still worth to think a bit more to formalize the wants. In particular, there are a few parts of items that are worth considering:

Formalize the Transformation of Buffer to the Pointer(Change of Fn Signature)

The particular transformation we are looking for is actually to transform a function with buffer map by directly passing in its data pointer. Note that most of TVM’s lowering transformations preserves

Consider the constructing code

import tvm                                                                                                                                                                                                  
import tvm.script                                                                                                                                                                                           
from tvm import te                                                                                                                                                                                          
                                                                                                                                                                                                            
def unpacked_example():                                                                                                                                                                                     
    A = te.placeholder((4 , 5), name="A")                                                                                                                                                                   
    B = te.compute(A.shape, lambda *i: A(*i) + 1, name="B")                                                                                                                                                 
    s = te.create_schedule(B.op)                                                                                                                                                                                                                                                                                                                                                                  
    mod = tvm.lower(s, [A, B])                                                                                                                                                                              
    print(tvm.script.asscript(mod))                                                                                                                                                                         
                                                                                                                                                                                 
unpacked_example()                                                                                                                                                                                          

This will give us the following script

@tvm.script.tir
class Module:
    def main(A: ty.handle, B: ty.handle) -> None:
        # function attr dict
        tir.func_attr({"global_symbol": "main", "tir.noalias": True})
        A_1 = tir.match_buffer(A, [4, 5], elem_offset=0, align=128, offset_factor=1)
        B_1 = tir.match_buffer(B, [4, 5], elem_offset=0, align=128, offset_factor=1)
        # body
        for i0, i1 in tir.grid(4, 5):
            B_1.data[((i0*5) + i1)] = (tir.load("float32", A_1.data, ((i0*5) + i1)) + tir.float32(1))

From the data structure’s PoV, the above code only refers to the pointer B_1.data and A_1.data. We can create a pass (say replace signature) that tries to changes the signature of the function from list of buffers(requires DLTensor*) to just the data pointers themselves. This transformation can hold as long as all the referenced variables are covered, and the desired code is lile

@tvm.script.tir
class Module:
    def main(Adata: ty.handle, Bdata: ty.handle) -> None:
        # body
        for i0, i1 in tir.grid(4, 5):
            tir.store(Bdata, tir.load("float32", Adata, ((i0*5) + i1)) + tir.float32(1)), ((i0*5) + i1))

Note then the function can be directly passed to the code generator, which generates the function with signature

int main_func(void* Adata, void* Bdata);

The main point is that we do not have to try to twist the MakePackedAPI to generate another kind of type erased API here. If what we want is the normal C calling convention that passes in the field separately, we should add this pass after lowering to change the expected function signature, then the followup calls would follow naturally (via the normal C function calling convention).

The MakePackedAPI as it is is supposed to preserve the function signature(of taking buffers) regardless of unpacked choices. So transforming the signature should go to another pass.

The Choice of Type-Erased API

My main concern about the current proposal is the introduction of another type erased interface, namely

typedef int32_t(tvm_function_t)(void** inputs, void** outputs, void* resource_handle);`

Given most of the internals can already be readibly handled by the raw C typed version. The only usage of the type-erased function is when it comes to interfaces. In that case, I still beleive that PackedC function is the right choice, as it brings the benefit of standarization and consistency with the overall TVM ecosystem.

Impact of Compiler Optimizations

Finally, it is still useful to think about compiler optimizations and how can they impact the choices in the table. Modern compilers can do a lot of things, making it possibly to get as optimized code as long as we can inline the function correctly.

Let us consider an example code below

#include <cstdio>
#include <tvm/runtime/c_runtime_api.h>

inline int PackedCFunc(void* args, int* type_codes, int num_args,
                       void* out_ret_value, int* out_ret_tcode,
                       void* resource_handle) {
  DLTensor* dlx = (DLTensor*)(((TVMValue*)args)[0].v_handle);
  DLTensor* dly = (DLTensor*)(((TVMValue*)args)[1].v_handle);
  // error check that can be dead-code eliminated
  if (type_codes[0] != kTVMDLTensorHandle) {
    return -1;
  }
  if (type_codes[1] != kTVMDLTensorHandle) {
    return -1;
  }
  if (dlx->shape[0] != 3) {
    return -1;
  }
  if (dlx->shape[1] != 3) {
    return -1;
  }

  if (dly->shape[0] != 3) {
    return -1;
  }
  if (dly->shape[1] != 3) {
    return -1;
  }
  if (dly->dtype.code != kDLFloat) {
    return -1;
  }

  ((float*)dlx->data)[0] =   ((float*)dly->data)[0] + 1;
  return 0;
}

// return y[i] = x[i]  +1
extern "C" int AddViaPackedCFunc(float *x, float* y) {
  TVMValue args[2];
  int type_codes[2];
  TVMValue out_ret_value;
  int out_ret_tcode;
  int64_t shape[2] = {3, 3};

  DLTensor dlx, dly;
  dlx.data = x;
  dlx.ndim = 2;
  dlx.shape = shape;
  dlx.dtype.code = kDLFloat;
  dlx.dtype.bits = 32;
  dlx.dtype.lanes = 1;
  dlx.device.device_type = kDLCPU;
  dlx.device.device_id = 0;
  dlx.strides = nullptr;
  dlx.byte_offset = 0;
  dly = dlx;
  dly.data = y;

  args[0].v_handle = &dlx;
  args[1].v_handle = &dly;

  type_codes[0] = kTVMDLTensorHandle;
  type_codes[1] = kTVMDLTensorHandle;

  // note: check can be dead-code eliminated
  if (PackedCFunc(args, type_codes, 2, &out_ret_value, &out_ret_tcode, nullptr) != 0) {
    printf("error\n");
  }	
  return 0;
}

Run clang

clang-10 -O2 -emit-llvm -S  -I ../../tvm/3rdparty/dlpack/include -I ../../tvm/include -o test.ll test.cc

The result is

; Function Attrs: nounwind uwtable
define dso_local i32 @AddViaPackedCFunc(float* %0, float* %1) local_unnamed_addr #0 {
  %3 = load float, float* %1, align 4, !tbaa !2
  %4 = fadd float %3, 1.000000e+00
  store float %4, float* %0, align 4, !tbaa !2
  ret i32 0
}

Run gcc

gcc -O2 -S  -I ../../tvm/3rdparty/dlpack/include -I ../../tvm/include -o test.s test.cc

Gives the following asm code

	.file	"test.cc"
	.text
	.p2align 4,,15
	.globl	AddViaPackedCFunc
	.type	AddViaPackedCFunc, @function
AddViaPackedCFunc:
.LFB31:
	.cfi_startproc
	movss	.LC0(%rip), %xmm0
	xorl	%eax, %eax
	addss	(%rsi), %xmm0
	movss	%xmm0, (%rdi)
	ret
	.cfi_endproc
.LFE31:
	.size	AddViaPackedCFunc, .-AddViaPackedCFunc
	.section	.rodata.cst4,"aM",@progbits,4
	.align 4
.LC0:
	.long	1065353216
	.ident	"GCC: (Ubuntu 7.4.0-1ubuntu1~18.04.1) 7.4.0"
	.section	.note.GNU-stack,"",@progbits

As we can see that even with the same PackedFunc API, as long as we can do proper inlining, allocating DLTensor and other items on stack, the resulting function call can be reduced to the same function as the minimum non-packed version.

Discussions

Considering the importance of a minimum internal, I agree that we could explore an un-packed interface(essentially generating something that is related to C). We should do that in a proper way, by introducing a function signature transformation utility that transforms the function signature from the original DLTensor* to the destructed fields.

However, we should also note that generating the DLTensor on stack and setting up constant correctly might also bring similar effect in a modern compiler.

When it comes to type-erased interface, assuming we only need them at the interface level(not the internals). I think it is useful to keep the CPackedFunc convention, so that we still retain the benefit of additional wraping to expose to the externals and standardization. Again in this case carefully allocating the DLTensor on stack then pass it in plus strong inlining/constant folding could remove the overhead of DLTensor even at the interface level.

hi @Mousius,

thanks for your proposal and your detailed evaluation of its impact on runtime resources! i agree these improvements will simplify the generated code and improve performance. Some thoughts below:

When we packed and unpack values, we make use of the data portion of a DLTensor but nothing else, this leads to a lot of the structure being unused. In embedded systems space can become an absolute premium and this unused structure consumes precious bytes.

One consideration is that removing metadata such as shape will constrain our ability to implement models with more complex runtime requirements, such as dynamic shapes. I think dynamic shapes need significant consideration when implementing in constrained spaces such as µTVM, but it would be great to ensure we don’t add a barrier to PoC with these optimizations. By that here, I just mean that I’d like to ensure we retain a path to keep DLTensor support independent of the other parts of this proposal e.g. API changes, if possible. This may mean we need to e.g. analyze which fields of the DLTensor are used and pass each field as an argument.

These represent not just code sizes but also cycle times and power usage, further to this the stack savings would allow such a model to run under Zephyr on an M0 which by default is allocated only small stack sizes

There are a lot of impacts on various performance numbers here, but my opinion is that the stack impact is outsize compared with the others. I agree we should investigate this on that basis alone.

One issue is that device_type and device_id are checked later and must be bound to pass the invariant checks.

It seems like here, we should hoist those checks up into the AOT function in whichever pass modifies the function signatures. I don’t think we should go away from device_type/device_id at this point–but I would like to explore a way to more accurately represent device_type to the compiler such that multiple different e.g. BYOC or CPU devices can be described in terms familiar to the user.

  • What should the flag be called? --unpack-functions, –tiny, --no-runtime, –micro etc?

This is really only an option with the AOT executor. Maybe --aot-use-typed-signatures or something? I’d also like to address the comments from the AOT PR about pushing executor and runtime options out of Target. If we do this, it may be possible to group this under e.g. aot_executor_opts, and then perhaps another name could make more sense.

@tqchen:

As we can see that even with the same PackedFunc API, as long as we can do proper inlining, allocating DLTensor and other items on stack, the resulting function call can be reduced to the same function as the minimum non-packed version.

It is often a problem in embedded systems that enabling the optimizer can do things like change timing of functions which can result in drastically different behavior of the SoC. In some cases while under development, SoCs may not function with optimization on or vice versa. Developers may not simply be able to choose which optimization level suits them.

Another common case is debugging TVM-generated code, which is often interesting with optimizations off. While you’re certainly right that modern compilers may be able remove unused portions of DLTensor at -O2, relying on this behavior means that a developer’s first task in debugging generated code is to understand why they’re seeing a stack protection fault, and increase the stack size when compiling with -O0. This is not a desirable property for embedded deployment, and it’s a big ask particularly when they are trying to debug generated code at -O0 which is likely in the same file.

The other thing to point out here: this proposal affects the top-level interwork code that links operators together. It is in our interest to make this code as simple and straightforward to understand as possible. This point is so important that we should probably also support a mode where the AOT top-level function is exported in C and the rest in LLVM.

For this reason, I am not in favor of relying on downstream compilers to clean up the stack for us. We should emit code which is straightforward and uncluttered, so that we don’t introduce hidden dependencies on particular compilers which may come to bite us later with attempts to broaden support for µTVM on other devices.

Other discussion topics

This means that by reducing the overall dependencies, such as removing the need for DLPack, we can reduce the amount of foreign code required.

I’m not sure I agree DLPack is a huge overhead in terms of build dependencies here–it’s included in the standalone_crt bundle. Could you clarify?

Thanks for the replies @tqchen and @areusch :smile_cat:, there’s a lot of great discussion to come back to after the long weekend!

Impact of Compiler Optimizations

I think @areusch sums up the arguments for not pushing this onto the compiler, it’s likely existing embedded projects will be using a compiler that works for a given device and environment rather than having access to a more modern compiler. Even if the TVM aspects of the project were compiled in this way, the interface the user is using would still be compiled without it - leaving the overheads in place at the application level.

Impact of DLTensor Removal

Whilst I agree in principal, in practice I wouldn’t expect to see many dynamic models being ran in embedded environments where dynamic allocation is usually discouraged, and where the shapes of data are usually known in advance. That in mind, should there be sufficient space for them to run in this way, they can always turn the option off and incur the penalty for the dynamic behaviour?

I’m considering it more in terms of dependencies you have to move across to include in your project, particularly in terms of raw files; standalone_crt contains a lot of files that I’m unlikely to need in a deployment, I wouldn’t expect to integrate all of them into an embedded system, particularly if I have strict requirements for code being included. My expectation is for the standard output of TVM, you will only need one or two additional headers and only when you want to opt into other features you’ll take pieces of the runtime across. In this case, dlpack means you have at least one additional dependency beyond the minimal case.

The Choice of Type-Erased API

I understand your concerns about introducing a new API here, I’ll try to illustrate my concerns from an application developer point of view; using the current packed function API but instead passing a parameter such as --no-dltensors would result in the following user code:

extern int MyEntrypointPackedCFunc(void* args, int* type_codes, int num_args,
                       void* out_ret_value, int* out_ret_tcode,
                       void* resource_handle);

void main() {
    // Some code
    float input[1] = {1.0};
    float output[1];
    void* inputs_and_outputs = {
        input,
        output
    };
    MyEntrypointPackedCFunc(inputs_and_outputs, 0, 0, NULL, 0, NULL);
}

There’s two things I see useful to discuss here:

  1. From a users point of view, they now have no clear distinction between inputs and outputs, this has weird ergonomics and requires more understanding of TVM (even if wrapped by a TVMExecute(void* args, void* resource_handle) type function to remove the unused arguments). The packed function API does have access to out_ret_value but it is used for single values rather than the packed buffers which are used in operators, so would it feel weird to have the entry function work differently with the same API? Currently we use a wrapper such as void* inputs_and_outputs[size_of_input+size_of_outputs] to improve this but that uses unnecessary stack space that is likely to produce worse output due to the sizes being seen as dynamic by a compiler. So there’s a few options:
    1. Generate out an additional wrapper function to call the packed function
    2. Accept this ergonomically for the compatibility with other aspects of TVM and pass this onto users
    3. Change the packed function API sufficiently to introduce the better ergonomics, potentially just re-using out_ret_value to accept multiple outputs
  2. If a user decides to remove the --no-dltensors parameter then we now have an opportunity for a user to attempt to link this into their application and it most likely just seg faults. By changing the API we at least get a link error, this seems friendlier, so then we have some options:
    1. Accept the risk that misuse of the compiler can cause segfaults
    2. Change the packed function API sufficiently to trigger a friendlier error

Formalize the Transformation of Buffer to the Pointer(Change of Fn Signature)

Totally agree with using a different pass for this than MakePackedAPI, my intention with bundling them together was to re-use it for internal consistency but it’s much neater to split it. Regards the placement of this, I’d be interested in an alternate path rather than unpacking after MakePackedAPI, something similar to:

tvmc --no-dltensors ....
  if (pass_ctx->GetConfig<Bool>("no_dltensors", Bool(false)).value()) {
    mixed_pass_list.push_back(MakeUnpackedAPI());
  } else {
    mixed_pass_list.push_back(MakePackedAPI());
  }

If much of the metadata is available already it makes more sense to skip the MakePackedAPI() call altogether and fix up anything missing in a separate pass.

Parameter Naming

It’s worth noting I think it’s worth getting this right at the user level so we don’t need to change it much when arguments get refactored. For this RFC, it’s fine to have it nested in the target, thinking about how we break it out of the target later is useful though as it can be factored into that rework. Looking at other compilers, it doesn’t feel right to prefix the behaviour with --aot-X but rather document it as a feature that works in certain environments - similar to architecture specific compiler flags?

By “typed signatures” do you mean those with the type information in the DLTensor? I’m not sure the default behaviour of AOT should be to not support DLTensor, potentially --no-typed-signatures to opt out? That would propagate to something like executor_opts in relay.build and beyond.

Is it worth moving the DLTensor addition to a separate pass altogether and introducing two arguments?

  1. --no-typed-signatures - Doesn’t run the pass to put convert pointers into Buffers/DLTensors
  2. --no-packed-arguments - Runs MakeUnpackedAPI rather than MakePackedAPI

This would look something like:

  if (!pass_ctx->GetConfig<Bool>("no_typed_signatures", Bool(false)).value()) {
    mixed_pass_list.push_back(AddBuffersToArguments());
  }
  if (pass_ctx->GetConfig<Bool>("no_packed_arguments", Bool(false)).value()) {
    mixed_pass_list.push_back(MakeUnpackedAPI());
  } else {
    mixed_pass_list.push_back(MakePackedAPI());
  }

It means maintaining more permutations longer term, but it does make more logical sense to decouple the type of the parameters vs how they’re passed.

Thanks @Mousius for the discussion. I would like to further clarify the interface discussion and the general behavior. There are a few items being discussion

Choice of argument encoding:

  • A0: DLTensor ptr argument passing
  • A1: Pass the destructed field passing(pass the data ptr and possible shape value in DLTensor)

Choice of calling convention

  • C0: PackedCFunc calling convention
  • C1: Standard c typed calling convention, int myfunc(void* input_0, void* input_1, void* output_0)
  • C2: A new type-erased convention: e.g.typedef int32_t(tvm_function_t)(void** inputs, void** outputs, void* resource_handle);

Current Approach

The TVM’s standard way, not considering the embedded system uses the following combination:

  • Internal operator code: A0 + C0
  • Entry function to graph execution: A0 + C0

The main reason for A0 + C0 is that this is the convention that is designed compatible with TVM’s FFI ecosystem. Enables direct poking from python, running over RPC etc.

Dicussion for Embedded Settings

Considering the embedded system application scenario. I think we agree that:

  • For internal operators, we could use A1 + C1 to reduce the dependency on compiler optimizations.

The main question of interest is what do we do for interface entry functions. One of the current proposal is to use A1 + C2.

My main concern here is that C2 brings one additional calling convention that also comes with additional restrictions(e.g. if we want to pass in shape to do runtime checking, it is not doable, if in a case where we want to pass in string for some reason to the interface, it is no longer as natural).

To address the concern of confusing users. I think we should stick with minimum amount of combinations, namely either A0 + C0 or A1 + C1. The confusion situation you mentioned is due to the combination of A1+ C0 (which I agree is indeed confusing).

Of course it is still fine to introduce a standard interface like C2 if we want to have an embedded specific executor interface that wraps the underlying code. However, I think it can be confusing to make C2 a target of code generation.

So for the interface itself, if really necessary, we could go with the A0 + C0. We should also consider build (optional) compatible layers that provides A1+C1 capability, so we can poke the embedded runtime through an RPC using python, providing better development experiences

Dependency on DLTensor

I fully understand the concern to minimize dependency. In the meantime, I also want to point out the benefit bought by standardization. DLTensor is a standard that is being widely adopted by frameworks. Reusing a standard data structure at the API level would likely help to grow the ecosystem further without having to worry about API compatibility. e.g. if there is a need to introduce a DLTensor for accelerator device, the same data structure applies.

Given dlpack.h is a standalone C header (200 lines), I feel in this case the overhead of dependency is really minimum when considering the potential benefit of standardization.

In terms of the cost, while I agree that incurring cost at a per operator level is not desirable(thus we could go with A1 + C1 for interal op invocation), i believe the cost incurred at the interface level is OK given we are only doing it once. Additionally, as embedded system compiler evolve I believe such overhead will likely go away due to the reasons mentioned above.

Agreed, let’s set this aside as the preferable option for internal operations :smile_cat:.

Dicussion for Embedded Settings

For interface code A1 is still preferable as the amount of stack space required to create two DLTensor’s is roughly 100 bytes (I had to remove the dynamic allocation to find this value, and I’m cheating by re-using some statically allocate shape and context which are just filling the gaps in the DLTensor). For the Cortex M0 I referenced the default stack size in Zephyr is 640 bytes, so using 100 bytes is significant with no gain.

Regards compilers improving, I’d view this as a slow process which is reliant on new designs being created - augmenting existing applications with ML will require integrating older compilers for the foreseeable future.

I believe you mean we could go for A1 + C1 which would mean the entrypoint function would look something like this:

int32_t run_func(void* input_0, void* input_1, void* output_0, void* resource_handle);

This presents a slightly different issue, which is that C0 or C2 can both do something similar to:

int32_t run_func(void* args, int* type_codes, int num_args, void*ret, void* resource_handle) {
    void* input_0 = args[0];
    void* input_1 = args[1];
    void* output_0 = ret[0];
    void* sid_1 = TVMBackendAllocWorkspace(404);

    operators_with_no_knowledge_of_packing(input_0, input_1, sid_1, output_0);
}

Which means you can have a generic interface to run_func rather than the user having to call it directly:

int32_t TVMExecute(void* run_func, void* inputs, void* outputs, void* resource_handle) {
    return run_func(inputs, NULL, -1, outputs, resource_handle);
}

This allows TVM to put some form of runtime wrapper around it - illustrating the usefulness of the packed function API here over the normal calling style. The other benefit of using A1+C2 that occurs to me is the ability to fairly quickly provide an interface to it using normal packed functions:

int32_t normal_packed_func(void* args, int* type_codes, int num_args, void*ret, void* resource_handle) {
    // unpack some pointers
    smaller_packed_api(args, ret, resource_handle);
}

Of course, A1 + C0 would give a similar result without an extra API but still provide the user with a potential segfault if the options aren’t passed consistently. So my preference is still for A1+? where ? is either C0 or C2 - do you have any thoughts as to what a C2 would look like in this case @tqchen?

Dependency on DLTensor

I don’t disagree with DLTensor being a useful standard to be able to leverage, and we should provide an option to re-enable the DLTensor layer for users with sufficiently powerful devices which will likely function similar to a richer OS. For constrained devices I’d expect for the driver for a device to come from an embedded RTOS which doesn’t use DLTensor’s and the DLTensor values regards the actual device identifiers and selection would be built into the AOT compiler output.

The size of the header isn’t as substantial as the requirement for a user to port it into their application. Where I’m coming from here is that if TVM can produce a small enough number of files that are easy for users to understand, it’ll be easier to integrate into environments where each file must be scrutinised before usage. In such environments I wouldn’t be surprised if users modified the code to remove DLTensor’s themselves.

Thanks @Mousius Given that there is only one entry function, I believe have some form of wrapping is fine.

The main thing that I am hesitant about is to provide C2 as a general first class citizen of code generation for any code due to the restrictions it imposes and one additional ABI that the users need to learn. So it would be good if we provide first class codegen for C1 and C0 for an arbitary function in tvm.

This does not mean that we cannot provide an interface function(e.g. run) that comes with a C2 style API, and provide a quick util to generate that function(say from a C1 interface), and that function only.

Thanks for working through this @tqchen :smiley_cat:.

The most straightforward way of providing the entrypoint is just to ask AOT to generate it in the format required. Thinking about how we expose this to the user, here is a proposed behaviour:

  • Default behaviour of AOT should be A0+C0 to be compatible with the rest of the ecosystem
  • With an optional parameter --no-typed-operators TVM would instead produce A1+C1 internal operators but leave the A0+C0 entrypoint
  • With an optional parameter --micro-entrypoint TVM (in AOT mode) would switch to producing a A1+C2 entrypoint at the top level. This has no effect on Graph execution as it doesn’t have such an entrypoint.

To ensure that the micro entrypoint isn’t exposed as a standard part of the C Runtime, we can put it in a tvm_micro_backend.h within src/runtime/crt as an optional include on top of the existing C Runtime. The use of the tvm_ prefix is to facilitate it being copied into a user application alongside other files at root level.

Thoughts @tqchen/@areusch?

Hello.

This is a very important topic for us at ST Microelectronics as well.

For someone not familiar with the implementation, it’s hard to jump into the details of the proposal. From the high-level, I am having 2 observations:

  • This proposal aims at enabling the AoT to generate a leaner, more optimal code for embedded targets. From this prospective, I would agree that changes to the internal TVM behavior should be minimized. As proposed by @tqchen, a lowering pass seems like the best approach.
  • The tensor information, such as shape, dltype, etc. may be needed by the application, even outside of the dynamic shape considerations. For example, knowing the inputs/outputs shape and type, or params shape and size is already used by our applications. From this prospective, the model should provide this infomation to the application. I agree that the DLTensor carries an overhead (64-bit types, perhaps device id). It seems reasonable that the DLTensor and TVMValue are replaced with something lighter during the final lowering pass.

Hope these observations do not fall off the discussion.

Hi @stoa, thanks for your observations and apologies for not replying sooner.

I’m glad you agree with the initial direction taken and I appreciate that there may be a need to provide similar data to the user to that which is seen in DLTensor, at this stage we’ll get that by providing the option to return to the packed function signature. Further work should be done to ensure we have all the data available for an embedded system; to help motivate that conversation I’ve published a companion RFC - [RFC] [uTVM] Embedded C Runtime Interface, hopefully this helps alleviate the concerns around metadata access and clarifies how this work can be accessed from an embedded system :smile_cat: