@tqchen @junrushao @comaniac @jroesch @mousius @manupa-arm @csullivan @mbs-octoml @kparzysz
Summary
This RFC describes a Module-based Model Runtime interface for the Ahead-of-Time Executor, thereby enabling its use from the TVM C++ Runtime.
Motivation
The microTVM project has made significant progress towards an Ahead-of-Time Executor for compiled Relay models. At the time of writing, it’s now possible to codegen a TIR function which executes Relay models that have known shapes, don’t have graph-level control flow, and execute only on the CPU device. Right now, the C runtime is the only such runtime environment which can interact with this generated code. However, significant interest exists in enabling the C++ runtime to use the Ahead-of-Time executor.
Guide-level explanation
Users select the AOT executor at compile time through the traditional GraphExecutor compilation flow (e.g. tvm.relay.build
) by including --executor=aot
in the Target [1]. The return value of tvm.relay.build
in this case is an AotExecutorFactory
Module object. Users instantiate the AOT executor via AotExecutorFactory
as they do with GraphExecutor
:
ir_mod = tvm.parser.fromtext("""\
#[version = "0.0.5"]
def @main(%a : Tensor[(1, 2), uint8], %b : Tensor[(1, 2), uint8]) {
%0 = %a + %b;
%0
}"""
)
with PassConfig(opt_level=3):
factory : AotExecutorFactory = tvm.relay.build(
ir_mod, "llvm -executor=aot", module_name="my_mod")
aot_executor : AotExecutor = factory["my_mod"](tvm.cpu(0))
AotExecutor
supports the traditional Module-Based Model Runtime Interface and can be used as a user normally would GraphExecutor
:
aot_executor.set_input("a", tvm.nd.array(np.ndarray([1, 2], dtype="uint8")))
aot_executor.set_input("b", tvm.nd.array(np.ndarray([3, 5], dtype="uint8")))
aot_exec.run()
output = aot_exec.get_output(0)
assert output.asnumpy() == np.ndarray([5, 7], dtype="uint8")
[1] NOTE: The target string is not the final place this customization should be made. However, it’s been the place where we’ve been putting runtime-related stuff. A separate RFC will split the Target string into Target options (which affect tuning) and runtime options.
Reference-level explanation
Already committed to TVM is the AotExecutorCodegen. This module produces a TIR top-level function which invokes the Relay operators (implemented in TIR) in a correct order. An example is given below:
PrimFunc([input1, input2, output]) attrs={"global_symbol": "tvmgen_my_mod_run_model", "runner_function": (bool)1} {
// attr [(nullptr)] device_id = 0
// attr [(nullptr)] device_type = 1
tir.tvm_call_packed("tvmgen_my_mod_fused_add", input1, input2, output)
}
The AotExecutor then needs to accomplish the following to meet Module-based Model Runtime Interface:
- Allocate input and output tensors as defined in the
run_model
function using the correct Device API - Provide a mapping from relay parameter name to positional argument
- Invoke the generated TIR function and provide profiling.
Compiler ↔ Runtime Metadata
In order to implement (1) and (2) above, additional metadata about the run_model
function needs to be communicated from Compiler to Runtime:
- The mapping between Relay parameter name and TIR argument position
- The number of inputs and outputs
- The type of each parameter
- Information sufficient to choose a Device API to allocate memory for that data.
At present, Metadata is passed from Compiler to Runtime in several different ways:
-
Constant DLTensor can be bundled with code and supplied to
runtime::Module
viaruntime::MetadataModule
-
Many non-DSO-exportable backends (
cuda
,hexagon
,metal
,opencl
,sdaccel
,rocm
,vulkan
) have adopted the convention of including aruntime::FunctionInfo
(NOTE: distinct fromtvm::relay::transform::FunctionInfo
) in their serialization:/*! \brief function information needed by device */ struct FunctionInfo { std::string name; std::vector<DLDataType> arg_types; std::vector<std::string> launch_param_tags; }
-
AotExecutorCodegen and GraphExecutorCodegen have adopted the practice of producing the graph-level
runtime::MetadataNode
:/*! * \brief Structure that can be optionally used by the executor codegen */ class MetadataNode : public Object { public: /*! \brief input information for the main function */ Array<String> inputs; /*! \brief number of outputs of the main function */ int num_outputs = 1; /*! \brief the executor to be used to run the model */ String executor = kTvmExecutorGraph; String mod_name = ""; }
-
The recent AOTExecutor implementation has created
tvm::relay::transform::FunctionInfo
which communicates statistics about memory usage and I/O operation for each TIR operator and aggregate statistics for the top-level AOT function:struct FunctionInfoNode : public Object { Map<Target, Integer> workspace_sizes; Map<Target, Integer> io_sizes; Map<Target, Integer> constant_sizes; Map<Target, tir::PrimFunc> tir_primfuncs; Map<Target, Function> relay_primfuncs; }
Some duplication of information is already present. Likely this is due in part to the existing middle-end compiler design, in which a separate IRModule
is produced for each backend. Another factor may be: since runtime::Module
are responsible for their own serialization, and passing Node
across PackedFunc
requires a cast, the lack of a centralized facility for runtime::Modules
to obtain module-level Metadata has led backend authors to roll their own. This pattern means that it’s very difficult to assess the full scope of metadata handed to the runtime, particularly across all backends.
Work is currently ongoing to unify the pre-codegen IRModule
into a single instance. After this work is completed, it will be much easier to produce a centralized module-level Metadata. This RFC argues for the expansion of runtime::MetadataNode
in the following ways:
-
Rename
runtime::MetadataModule
toruntime::ConstLoaderModule
to disambiguate the two and make its purpose in life clearer. -
Expand
input_args
in the existingruntime::Metadata
to parity withruntime::FunctionInfo
, plus include_sizes
fromtvm::relay::transform::FunctionInfoNode
and the requiredshape
anddtype
information from the beginning of this section. -
Introduce
ModelMetadataModule
to contain this information for use with the C++ runtime.class ModelMetadataModule { virtual GetFunction(const std::string& name, ObjectPtr<Object>& sptr_to_self) { if (name == "get_model_metadata") { return PackedFunc([](TVMArgs args, TVMRetValue* rv) { *rv = ModelMetadata(metadata_); }); } else { return PackedFunc(); } } const struct ModelMetadata* metadata_; };
-
Introduce an optional implementation for the C runtime.
-
Export runtime::Metadata to Model Library Format.
The new proposed definition of runtime::Metadata
is as follows. NOTE that this is a C definition because it will be made available both the C and C++ runtimes. A C++ wrapper will be written.
struct ParameterInfo {
const char* relay_name_hint;
const char* tir_name_hint;
int64_t* shape;
int64_t ndim;
DLDataType dtype;
TargetDevice target_device; // NOTE: future addition; not covered in this RFC.
};
struct FunctionInfo {
const char* function_name;
struct ParameterInfo* params;
int num_inputs;
int num_outputs;
int64_t workspace_size_bytes;
int64_t io_size_bytes;
int64_t constant_size_bytes;
};
typedef struct Metadata {
int version;
struct FunctionInfo* functions;
const char* module_name;
};
Internal workings of AotExecutor (--runtime=c++ --interface-api=packed
)
Given the above, we can now sketch out the way AotExecutor should behave (for C++ runtime).
Module initialization will:
- Load the
ModelMetadata
usingget_model_metadata
PackedFunc. - Allocate space for the parameters to
tvmgen_<model_name>_run_model
. - Lookup and load any linked parameters using the
--link-params
mechanism.
-
set_input
,get_input
,get_output
all work as they do inGraphExecutor
. -
run
assemblesTVMArgs
containing inputs + outputs and invokestvmgen_<model_name>_run_model
. -
time_evaluator
is implemented in the same way as it is inGraphExecutor
. Timingrun_model
is done using the CPU timer.
Internal workings of AotExecutor (--runtime=c --interface-api=packed
)
The C runtime version works in a very similar way with C accessor functions for the ModelMetadata
.
No AotExecutor implementation planned (--runtime=c --interface-api=c
)
When -interface-api=c
is present in the Target string, the run_model
function no longer accepts the PackedFunc interface and instead accepts arg_values
directly as positional args:
TVM_DLL int32_t tvmgen_default_run_model(void* arg0, void* arg1, void* arg2) {
void* input = arg0;
void* input1 = arg1;
void* output = arg2;
(void)tvmgen_default_fused_multiply(input, input1, output);
return 0;
}
Additional work is underway to wrap this in a firmware-friendly interface. A core design goal of this interface is to offload all memory management tasks to the calling code to facilitate integration with bare-metal embedded devices.
Therefore, it would go against the goals of the C interface to introduce a generic runtime wrapper compatible with PackedFunc calling convention. It may be possible to do so in the future, but it would be great to motivate such an implementation with rationale more related to the embedded runtime setting.
Operator Calling Convention
TVM uses 3 internal calling conventions:
-
call_packed
- the traditional calling convention used in the C++ runtime -
call_cpacked
- similar tocall_packed
, but TVM presumes a symbol is linked into the binary containing that function name (e.g.TVMBackendGetFuncFromEnv
is not used to lookup the PackedFunc) -
unpacked
- used with microTVM to avoid overhead of PackedFunc calls in statically-linked binaries. See AOT optimisations for Embedded Targets RFC.
The AOT run_func
can use a different calling convention externally (e.g. --interface-api
) than that used internally with Implemented Operators (--unpacked-args
). However, there are some circumstances under which not all choices can be used:
- When targeting the C++ runtime:
call_packed
must be used when non-DSO-exportable modules exist; otherwisecall_cpacked
may be used.unpacked
may not be used with AOT Executor as the interface has not settled. - When targeting the C runtime: any calling convention may be selected for either the interface API or the operator calling convention. However, when using
--interface-api=c
(e.g.unpacked
run_func
calling convention), you must also use theunpacked
calling convention with Implemented Operators.
Drawbacks
Why should we not do this?
- This requires quite a bit of rework of the Metadata-passing mechanism, with potential for breakage.
- It also introduces yet another Executor to the runtime to maintain.
- It may introduce additional constraints on the
<C-runtime, C-interface>
implementation, which may make it more difficult to make progress on microTVM.
Rationale and alternatives
- Why is this design the best in the space of possible designs?
- What other designs have been considered and what is the rationale for not choosing them?
- What is the impact of not doing this?
This RFC doesn’t address the question of “why add an AOT executor?” The RFC which added it in the first place is a better location to look for rationale to motivate that. In general, not following through with this RFC would relegate the AOT executor to a C-runtime-only component. There is significant interest in AOT from C++ runtime users, and maintaining compatibility with both increases the chances that AOT executor will support all TVM runtime features.
The controversial pieces of this RFC addressed are as follows:
Should we maintain a unified approach to code-generating the AOT executor?
An alternative approach could introduce an additional e.g. aot_cpp_executor_codegen.cc
and create a third pathway (in the Graph/AOT build flow). Doing this allows us to implement runtime-specific compiler primitives, which may simplify both pipelines. However, soon those pipelines will grow more complicated as features are added to leverage AOT, such as Unified Static Memory Planning. The burden of double-maintenance of those features outweighs the advantage of a simplified implementation. It also makes it easier for newcomers to understand the compiler.
Should we attempt to unify the Metadata?
Metadata could be left in the scattered form it is now. It may be that the implementation of this RFC prioritizes expansion of ModelMetadata
over propagating it to the various non-DSO-exportable runtime::Module
. Ultimately though, maintaining separate function-level metadata adds confusion and code bloat. It also makes it harder to reason about the compiler as a whole. For these reasons, this RFC advocates for centralizing the Metadata.
Prior art
There is no known prior art of a C+±runtime-compatible AOT implementation.
Unresolved questions
- Who will we break if we unify Model metadata?
- Will this play nicely with the VM compilation flow when it is unified?
- How will TargetDevice come in to play here?
Future possibilities
Not covered in this RFC, but particularly useful with the C++ runtime, is heterogenous execution. In the present PoC (forthcoming), AotExecutor will CHECK-fail if a non-cpu device is given. A future implementation will annotate the parameters with one of:
- A
device_type
— in which case mapping fromdevice_type
totvm::Device
will be done in the same way as theGraphExecutor
- A
target_device
— in which case a new mapping will be defined
Aside from that, the larger unresolved bit which makes it difficult to add heterogenous execution is:
How should AOT codegen invoke the Device API?
Before this question can be answered, some progress needs to be made on the C device API and we need to define TIR bindings.