Implementing AOT in TVM

tqchen · April 15, 2021, 1:01pm

Thanks @giuseros . To just discuss a bit on the difference between the following two type erased interface

X0: Function with typeid

typedef int (*TVMBackendPackedCFunc)(TVMValue* args, int* type_codes, int num_args,
                                     TVMValue* out_ret_value, int* out_ret_tcode,
                                     void* resource_handle)

X1: Function without typeid

typedef int (*TVMBackendCFunc)(void** inputs, void** outputs, void* resource_handle);

Discussions

The main reason that we choosed X0 over X1 is because X0 gives a safe interface for both static and dynamic languages. Imagine a case where the callee passes in a integer but the caller expects a float. X1 won’t provide any mechanism to detect such mismatch during runtime(if debug is enabled) while X0 allows us to provide type checking to do so.

X0 is also a complete Packed representation in a sense that function defined in X0 can be directly exposed to PackedFunc without any additional wrapping, giving the benfit of say debugging on host first before run on embedded, and invoke through RPC or python. Function exposed in X1 would requires a manual rewrapping, which defeats the purpose of type erasure.

Overhead of X0 over X1

Making a function call in the X1 convention would also requires stack allocations(for the array of inputs and outputs). Without considering any compiler optimization, if we get down to number of bytes in a 32bit system, a function call with n number of arguments one output. A function call in the form of X0 would cost us 8 * n + 4 * n + 4 + 8+ 4 + 4 = 12 * n +20 bytes of space, while a function call in the form of X1 would cost us 4* n + 4 * n + 4 = 8 * n +20 bytes of space. Say n=3 (a typical number), then X0 would cost 84 bytes, while X1 will cost 44 bytes.

The memory overhead of the function call, when comparing to the followup memory operations on NDArrays(which normally contains KB or more memory) is negilible.

Considering Compiler Optimizations

Additionally, this is considering no compiler optimization. Let us think about what will happen when the compiler inlines the call. In such cases the function call becomes a load and store into a heap memory.

With a typical mem2reg pass, the heap space can be promoted to registers. If callee code(operator) is compiled to not read the typeid in release mode, then the assignment to typeid becomes deadcode and will be eliminated by the compiler. Similarly the argument passing could become direct argument passing in this case. Considering these compiler optimizations, both X0 and X1 would allow optimizations that leads to similar performing code as the final direct call form.

Back to the topic of the int64, note that most of our operator call only uses void* as argument and not int64. The cost of int64 is mainly a memory overhead of passing argument rather than an ALU concern, both the caller and callee can feel free to convert to int32 after the passing, and assign int32 fields during passing(via an int32[2] array and assume little endian), again considering possible compiler optimizations above this could turns out to be nop.

Even in the absence of compiler optimizations, the general overhead incurred in the X0 is not too much larger than X1, and it would be great to do some on workload of interest to see the difference.