Thanks @Mousius for bringing this up. I think it would still worth to think a bit more to formalize the wants. In particular, there are a few parts of items that are worth considering:
Formalize the Transformation of Buffer to the Pointer(Change of Fn Signature)
The particular transformation we are looking for is actually to transform a function with buffer map by directly passing in its data pointer. Note that most of TVM’s lowering transformations preserves
Consider the constructing code
import tvm
import tvm.script
from tvm import te
def unpacked_example():
A = te.placeholder((4 , 5), name="A")
B = te.compute(A.shape, lambda *i: A(*i) + 1, name="B")
s = te.create_schedule(B.op)
mod = tvm.lower(s, [A, B])
print(tvm.script.asscript(mod))
unpacked_example()
This will give us the following script
@tvm.script.tir
class Module:
def main(A: ty.handle, B: ty.handle) -> None:
# function attr dict
tir.func_attr({"global_symbol": "main", "tir.noalias": True})
A_1 = tir.match_buffer(A, [4, 5], elem_offset=0, align=128, offset_factor=1)
B_1 = tir.match_buffer(B, [4, 5], elem_offset=0, align=128, offset_factor=1)
# body
for i0, i1 in tir.grid(4, 5):
B_1.data[((i0*5) + i1)] = (tir.load("float32", A_1.data, ((i0*5) + i1)) + tir.float32(1))
From the data structure’s PoV, the above code only refers to the pointer B_1.data
and A_1.data
. We can create a pass (say replace signature) that tries to changes the signature of the function from list
of buffers(requires DLTensor*) to just the data pointers themselves. This transformation
can hold as long as all the referenced variables are covered, and the desired code is lile
@tvm.script.tir
class Module:
def main(Adata: ty.handle, Bdata: ty.handle) -> None:
# body
for i0, i1 in tir.grid(4, 5):
tir.store(Bdata, tir.load("float32", Adata, ((i0*5) + i1)) + tir.float32(1)), ((i0*5) + i1))
Note then the function can be directly passed to the code generator, which generates the function with signature
int main_func(void* Adata, void* Bdata);
The main point is that we do not have to try to twist the MakePackedAPI to generate another kind of type erased API here. If what we want is the normal C calling convention that passes in the field separately, we should add this pass after lowering to change the expected function signature, then the followup calls would follow naturally (via the normal C function calling convention).
The MakePackedAPI as it is is supposed to preserve the function signature(of taking buffers) regardless of unpacked choices. So transforming the signature should go to another pass.
The Choice of Type-Erased API
My main concern about the current proposal is the introduction of another type erased interface, namely
typedef int32_t(tvm_function_t)(void** inputs, void** outputs, void* resource_handle);`
Given most of the internals can already be readibly handled by the raw C typed version. The only usage of the type-erased function is when it comes to interfaces. In that case, I still beleive that PackedC function is the right choice, as it brings the benefit of standarization and consistency with the overall TVM ecosystem.
Impact of Compiler Optimizations
Finally, it is still useful to think about compiler optimizations and how can they impact the choices in the table. Modern compilers can do a lot of things, making it possibly to get as optimized code as long as we can inline the function correctly.
Let us consider an example code below
#include <cstdio>
#include <tvm/runtime/c_runtime_api.h>
inline int PackedCFunc(void* args, int* type_codes, int num_args,
void* out_ret_value, int* out_ret_tcode,
void* resource_handle) {
DLTensor* dlx = (DLTensor*)(((TVMValue*)args)[0].v_handle);
DLTensor* dly = (DLTensor*)(((TVMValue*)args)[1].v_handle);
// error check that can be dead-code eliminated
if (type_codes[0] != kTVMDLTensorHandle) {
return -1;
}
if (type_codes[1] != kTVMDLTensorHandle) {
return -1;
}
if (dlx->shape[0] != 3) {
return -1;
}
if (dlx->shape[1] != 3) {
return -1;
}
if (dly->shape[0] != 3) {
return -1;
}
if (dly->shape[1] != 3) {
return -1;
}
if (dly->dtype.code != kDLFloat) {
return -1;
}
((float*)dlx->data)[0] = ((float*)dly->data)[0] + 1;
return 0;
}
// return y[i] = x[i] +1
extern "C" int AddViaPackedCFunc(float *x, float* y) {
TVMValue args[2];
int type_codes[2];
TVMValue out_ret_value;
int out_ret_tcode;
int64_t shape[2] = {3, 3};
DLTensor dlx, dly;
dlx.data = x;
dlx.ndim = 2;
dlx.shape = shape;
dlx.dtype.code = kDLFloat;
dlx.dtype.bits = 32;
dlx.dtype.lanes = 1;
dlx.device.device_type = kDLCPU;
dlx.device.device_id = 0;
dlx.strides = nullptr;
dlx.byte_offset = 0;
dly = dlx;
dly.data = y;
args[0].v_handle = &dlx;
args[1].v_handle = &dly;
type_codes[0] = kTVMDLTensorHandle;
type_codes[1] = kTVMDLTensorHandle;
// note: check can be dead-code eliminated
if (PackedCFunc(args, type_codes, 2, &out_ret_value, &out_ret_tcode, nullptr) != 0) {
printf("error\n");
}
return 0;
}
Run clang
clang-10 -O2 -emit-llvm -S -I ../../tvm/3rdparty/dlpack/include -I ../../tvm/include -o test.ll test.cc
The result is
; Function Attrs: nounwind uwtable
define dso_local i32 @AddViaPackedCFunc(float* %0, float* %1) local_unnamed_addr #0 {
%3 = load float, float* %1, align 4, !tbaa !2
%4 = fadd float %3, 1.000000e+00
store float %4, float* %0, align 4, !tbaa !2
ret i32 0
}
Run gcc
gcc -O2 -S -I ../../tvm/3rdparty/dlpack/include -I ../../tvm/include -o test.s test.cc
Gives the following asm code
.file "test.cc"
.text
.p2align 4,,15
.globl AddViaPackedCFunc
.type AddViaPackedCFunc, @function
AddViaPackedCFunc:
.LFB31:
.cfi_startproc
movss .LC0(%rip), %xmm0
xorl %eax, %eax
addss (%rsi), %xmm0
movss %xmm0, (%rdi)
ret
.cfi_endproc
.LFE31:
.size AddViaPackedCFunc, .-AddViaPackedCFunc
.section .rodata.cst4,"aM",@progbits,4
.align 4
.LC0:
.long 1065353216
.ident "GCC: (Ubuntu 7.4.0-1ubuntu1~18.04.1) 7.4.0"
.section .note.GNU-stack,"",@progbits
As we can see that even with the same PackedFunc API, as long as we can do proper inlining, allocating DLTensor and other items on stack, the resulting function call can be reduced to the same function as the minimum non-packed version.
Discussions
Considering the importance of a minimum internal, I agree that we could explore an un-packed interface(essentially generating something that is related to C). We should do that in a proper way, by introducing a function signature transformation utility that transforms the function signature from the original DLTensor* to the destructed fields.
However, we should also note that generating the DLTensor on stack and setting up constant correctly might also bring similar effect in a modern compiler.
When it comes to type-erased interface, assuming we only need them at the interface level(not the internals). I think it is useful to keep the CPackedFunc convention, so that we still retain the benefit of additional wraping to expose to the externals and standardization. Again in this case carefully allocating the DLTensor on stack then pass it in plus strong inlining/constant folding could remove the overhead of DLTensor even at the interface level.