[RFC] [uTVM] AOT optimisations for Embedded Targets

tqchen · April 29, 2021, 9:01pm

Thanks @Mousius for bringing this up. I think it would still worth to think a bit more to formalize the wants. In particular, there are a few parts of items that are worth considering:

Formalize the Transformation of Buffer to the Pointer(Change of Fn Signature)

The particular transformation we are looking for is actually to transform a function with buffer map by directly passing in its data pointer. Note that most of TVM’s lowering transformations preserves

Consider the constructing code

import tvm                                                                                                                                                                                                  
import tvm.script                                                                                                                                                                                           
from tvm import te                                                                                                                                                                                          
                                                                                                                                                                                                            
def unpacked_example():                                                                                                                                                                                     
    A = te.placeholder((4 , 5), name="A")                                                                                                                                                                   
    B = te.compute(A.shape, lambda *i: A(*i) + 1, name="B")                                                                                                                                                 
    s = te.create_schedule(B.op)                                                                                                                                                                                                                                                                                                                                                                  
    mod = tvm.lower(s, [A, B])                                                                                                                                                                              
    print(tvm.script.asscript(mod))                                                                                                                                                                         
                                                                                                                                                                                 
unpacked_example()

This will give us the following script

@tvm.script.tir
class Module:
    def main(A: ty.handle, B: ty.handle) -> None:
        # function attr dict
        tir.func_attr({"global_symbol": "main", "tir.noalias": True})
        A_1 = tir.match_buffer(A, [4, 5], elem_offset=0, align=128, offset_factor=1)
        B_1 = tir.match_buffer(B, [4, 5], elem_offset=0, align=128, offset_factor=1)
        # body
        for i0, i1 in tir.grid(4, 5):
            B_1.data[((i0*5) + i1)] = (tir.load("float32", A_1.data, ((i0*5) + i1)) + tir.float32(1))

From the data structure’s PoV, the above code only refers to the pointer B_1.data and A_1.data. We can create a pass (say replace signature) that tries to changes the signature of the function from list of buffers(requires DLTensor*) to just the data pointers themselves. This transformation can hold as long as all the referenced variables are covered, and the desired code is lile

@tvm.script.tir
class Module:
    def main(Adata: ty.handle, Bdata: ty.handle) -> None:
        # body
        for i0, i1 in tir.grid(4, 5):
            tir.store(Bdata, tir.load("float32", Adata, ((i0*5) + i1)) + tir.float32(1)), ((i0*5) + i1))

Note then the function can be directly passed to the code generator, which generates the function with signature

int main_func(void* Adata, void* Bdata);

The main point is that we do not have to try to twist the MakePackedAPI to generate another kind of type erased API here. If what we want is the normal C calling convention that passes in the field separately, we should add this pass after lowering to change the expected function signature, then the followup calls would follow naturally (via the normal C function calling convention).

The MakePackedAPI as it is is supposed to preserve the function signature(of taking buffers) regardless of unpacked choices. So transforming the signature should go to another pass.

The Choice of Type-Erased API

My main concern about the current proposal is the introduction of another type erased interface, namely

typedef int32_t(tvm_function_t)(void** inputs, void** outputs, void* resource_handle);`

Given most of the internals can already be readibly handled by the raw C typed version. The only usage of the type-erased function is when it comes to interfaces. In that case, I still beleive that PackedC function is the right choice, as it brings the benefit of standarization and consistency with the overall TVM ecosystem.

Impact of Compiler Optimizations

Finally, it is still useful to think about compiler optimizations and how can they impact the choices in the table. Modern compilers can do a lot of things, making it possibly to get as optimized code as long as we can inline the function correctly.

Let us consider an example code below

#include <cstdio>
#include <tvm/runtime/c_runtime_api.h>

inline int PackedCFunc(void* args, int* type_codes, int num_args,
                       void* out_ret_value, int* out_ret_tcode,
                       void* resource_handle) {
  DLTensor* dlx = (DLTensor*)(((TVMValue*)args)[0].v_handle);
  DLTensor* dly = (DLTensor*)(((TVMValue*)args)[1].v_handle);
  // error check that can be dead-code eliminated
  if (type_codes[0] != kTVMDLTensorHandle) {
    return -1;
  }
  if (type_codes[1] != kTVMDLTensorHandle) {
    return -1;
  }
  if (dlx->shape[0] != 3) {
    return -1;
  }
  if (dlx->shape[1] != 3) {
    return -1;
  }

  if (dly->shape[0] != 3) {
    return -1;
  }
  if (dly->shape[1] != 3) {
    return -1;
  }
  if (dly->dtype.code != kDLFloat) {
    return -1;
  }

  ((float*)dlx->data)[0] =   ((float*)dly->data)[0] + 1;
  return 0;
}

// return y[i] = x[i]  +1
extern "C" int AddViaPackedCFunc(float *x, float* y) {
  TVMValue args[2];
  int type_codes[2];
  TVMValue out_ret_value;
  int out_ret_tcode;
  int64_t shape[2] = {3, 3};

  DLTensor dlx, dly;
  dlx.data = x;
  dlx.ndim = 2;
  dlx.shape = shape;
  dlx.dtype.code = kDLFloat;
  dlx.dtype.bits = 32;
  dlx.dtype.lanes = 1;
  dlx.device.device_type = kDLCPU;
  dlx.device.device_id = 0;
  dlx.strides = nullptr;
  dlx.byte_offset = 0;
  dly = dlx;
  dly.data = y;

  args[0].v_handle = &dlx;
  args[1].v_handle = &dly;

  type_codes[0] = kTVMDLTensorHandle;
  type_codes[1] = kTVMDLTensorHandle;

  // note: check can be dead-code eliminated
  if (PackedCFunc(args, type_codes, 2, &out_ret_value, &out_ret_tcode, nullptr) != 0) {
    printf("error\n");
  }	
  return 0;
}

Run clang

clang-10 -O2 -emit-llvm -S  -I ../../tvm/3rdparty/dlpack/include -I ../../tvm/include -o test.ll test.cc

The result is

; Function Attrs: nounwind uwtable
define dso_local i32 @AddViaPackedCFunc(float* %0, float* %1) local_unnamed_addr #0 {
  %3 = load float, float* %1, align 4, !tbaa !2
  %4 = fadd float %3, 1.000000e+00
  store float %4, float* %0, align 4, !tbaa !2
  ret i32 0
}

Run gcc

gcc -O2 -S  -I ../../tvm/3rdparty/dlpack/include -I ../../tvm/include -o test.s test.cc

Gives the following asm code

	.file	"test.cc"
	.text
	.p2align 4,,15
	.globl	AddViaPackedCFunc
	.type	AddViaPackedCFunc, @function
AddViaPackedCFunc:
.LFB31:
	.cfi_startproc
	movss	.LC0(%rip), %xmm0
	xorl	%eax, %eax
	addss	(%rsi), %xmm0
	movss	%xmm0, (%rdi)
	ret
	.cfi_endproc
.LFE31:
	.size	AddViaPackedCFunc, .-AddViaPackedCFunc
	.section	.rodata.cst4,"aM",@progbits,4
	.align 4
.LC0:
	.long	1065353216
	.ident	"GCC: (Ubuntu 7.4.0-1ubuntu1~18.04.1) 7.4.0"
	.section	.note.GNU-stack,"",@progbits

As we can see that even with the same PackedFunc API, as long as we can do proper inlining, allocating DLTensor and other items on stack, the resulting function call can be reduced to the same function as the minimum non-packed version.

Discussions

Considering the importance of a minimum internal, I agree that we could explore an un-packed interface(essentially generating something that is related to C). We should do that in a proper way, by introducing a function signature transformation utility that transforms the function signature from the original DLTensor* to the destructed fields.

However, we should also note that generating the DLTensor on stack and setting up constant correctly might also bring similar effect in a modern compiler.

When it comes to type-erased interface, assuming we only need them at the interface level(not the internals). I think it is useful to keep the CPackedFunc convention, so that we still retain the benefit of additional wraping to expose to the externals and standardization. Again in this case carefully allocating the DLTensor on stack then pass it in plus strong inlining/constant folding could remove the overhead of DLTensor even at the interface level.