Posting up this pre-RFC for comments. For context it was written a few months back so apologies if things are slightly out of date.
Background
TVM’s high-level build function tvm.relay.build
currently returns a runtime::Module
instance meant to be “ready-to-run.” This means that tvm.relay.build
(and its backends) is responsible for both code generation, code serialization and code loading. As these processes become more and more complex, runtime::Module
must be overloaded to contain:
- Generated kernels to implement TIR tasks
- Metadata useful when loading and running the models (e.g. type of accelerator to target)
- When
Module
contains source code:- Source code for any downstream compilers
- Metadata to configure any downstream compilers e.g.
gcc
,cuda
, etc
In some cases, things are simple enough that we haven’t observed many ill effects from this overload. In extreme cases, such as when targeting exotic runtime scenarios e.g. microTVM, this overloaded structure makes it difficult to expand the compiler while staying within the “ready-to-run” output expected of tvm.relay.build
.
What is Artifact
?
Artifact
is a new TVM Object
subclass that replaces runtime::Module
as the return value from TVM codegen, including BYOC. To bridge the gap between Artifact
and Module
, this RFC proposes a new load_artifact
process which makes the code loading process explicit in the TVM C++ runtime. That is to say, tvm.relay.build
will stop returning runtime::Module
and start returning ArtifactSet
(e.g. a collection of Artifact
) instead. Artifact
is proposed to be defined as follows:
class Artifact : public ::tvm::runtime::Object {
public:
// Identifies the codegen that produced this artifact.
std::string codegen_id;
// Identifies the loader to use when loading this module
std::string loader;
// A file name unique within codegen_id.
std::string file_name;
// Binary content of this Artifact.
std::string content;
};
Why do we need Artifact
?
There are several different ways to motivate Artifact
. Given the scale of this refactor, you might not immediately choose to do this based on any one of them. However, when taken together, my opinion is that these issues stem from an anti-pattern developing in the TVM compiler data structures.
Defining “Code Loading”
One motivator of Artifact
is microTVM and the C runtime, where TVM is not necessarily producing binary code. Even when the llvm
backend is used (producing a .o
) the microTVM workflow is that the user must “load” the generated code by compiling the .o
into a firmware binary image. The question is: what is that “load” process?
What we want to describe is the equivalent of tvm.runtime.load_module(mod.export_library())
—just, with the load_module
call is being done at firmware compile time (targeting a very tiny µC with no RAM). It’s particularly hard to explain this for a few reasons:
- TVM actually has two code-loading processes now in normal Python-based C++ TVM runtime. When you
load_module(mod.export_library())
today, you get something very different from what you got fromtvm.relay.build
, becauseload_module
uses a different code path to construct Modules than is necessarily used in codegen. Neither of these processes make sense in a world whereload_module
is handled at firmware compile time. -
export_library
is designed to produce a.so
for the C++ runtime. The.so
contains a bunch of pieces (a few per codegen), and none of these pieces are named outside of theModule
type_key
. At present, we produce at least 2c
and 1-2llvm
modules already. It’s difficult to explain “move this Module to here” or “run a downstream translator on this Module” when they don’t have names.
Composite Targets
In some cases, code loading is straightforward (e.g. for llvm
backend, link directly into binary). There are plenty of others, particularly with BYOC, where this is not true:
- A system with two CPUs, a low-power simple CPU and high-power DSP.
llvm
must be configured twice and the output sorted into the different code memories for each CPU. - A system with many reconfigurable accelerators e.g. FPGAs or programmable DSP. Each accelerator instance would correspond to a
DLDevice
, but among those accelerators, configuration differences could complicate the code loading process.
None of these cases are a primary use case of TVM today, but the lack of metadata on TVM’s codegen outputs is a key obstacle to targeting systems such as these. And, none of these examples are rare or particularly strange designs.
Debugging
Module
that are saved using SaveToBinary
(e.g. type_key ≠ llvm
or c
) each implement their own serialization format. When a codegen produces multiple artifacts (e.g. ROCm, CUDA, Vitis-AI), the pattern has been to return a single Module
containing all the artifacts and concatenate them. This is very difficult for a user to debug from outside TVM.
TVM could provide a standard facility to write the generated code to disk in a human-readable way, but this is hard because there is no metadata attached to each individual piece of the binary. This identifying metadata is the same metadata that users need to consume each piece separately I.e. when doing the code loading yourself.
Module
re-use from codegen
Codegen must return generated code in a Module
. The rules around which Module
to use are confusing. When viewed from a runtime perspective, it seems perfectly clear that Module
implementation could be re-used whenever generated functions are executed the same way. But, when viewed from the codegen perspective, it’s unclear why an e.g. CSourceModule
should not be able to be used by any codegen producing C, particularly when it runs on a DLDevice
other than the target_host
.
The root issue is runtime.Module
's dual roles as code container and runtime interface. Its metadata is limited to a single field type_key
, which is essentially used during load_module
to decide how to produce Module
instance from the on-disk representation. It could be possible to add more metadata fields, but because of the dual roles of Module
, it may overcomplicate implementations which essentially sit at the cross-product of (output_format, runtime_method).
Code Loading Today—Two Paths to a GraphExecutor
This section further explores the two different codepaths TVM uses to load kernel code.
Steps to tvm.relay.build()
+----------+ +------------+ export_library()
| TIR Task | -> | Module |---------------|
+----------+ +------------+ ↓
↑ +-------------+
+----------------| lib.so |
load_module() +-------------+
Right now, tvm.relay.build
does the following (roughly—don’t hold me to this 100%):
- Relay Scheduling. Each Relay operator is implemented into TE with a template schedule.
- Optimization. TE is optimized, converted to TIR, and optimized again. e.g. operator fusion. A set of TIR tasks are produced (a task is one group of fused operators).
- Graph Memory Planning. TIR Task inputs and outputs are assigned to buffers
-
Code Generation. TIR Tasks are passed to a code-generator PackedFunc named
target.build.<kind>
.-
IR Transformation. The code-generator walks TIR and emits source code (
c
,cuda
, etc backends) or another IR (llvm
backend) -
Compilation. In most cases, a model is compiled to bytecode. In some cases (e.g.
c
,cuda
), compilation is skipped and done either at load time or whenModule#GetFunction
is called. -
Module
construction. Aruntime::Module
is created to hold the compiled artifact.
-
IR Transformation. The code-generator walks TIR and emits source code (
Output of tvm.relay.build()
The output of this process is a tree of Module. There are two possible topologies for this tree (*
indicates the actual Module
returned from tvm.relay.build
):
-
Topology 1: With only DSO-Exportable Module (
type_key
in("c", "llvm")
):+----------------------------------+ | BYOC output 2 (llvm or c module) | +----------------------------------+ ^ | (imports) | +--------------------+ (imports) +--------------------------------+ | * llvm or c output |------------>| BYOC output (llvm or c module) | +--------------------+ +--------------------------------+
-
Topology 2: With some non-DSO-Exportable Module:
+----------------------------------+ | BYOC output 2 (e.g. llvm module) | +----------------------------------+ ^ | (imports) | +------------------+ (imports) +--------------------------------+ | llvm or c output | | BYOC output (e.g. cuda module) | +------------------+ +--------------------------------+ | | +---------------------------------------+ | * Metadata Module (type_key=metadata) | +---------------------------------------+
(Suppose in this toy example that CUDA produces some code that runs on CUDA device and some that runs on the target_host CPU)
How the output is consumed
Now you can do two things:
-
Run inference straight away, by instantiating
GraphExecutor
.GraphExecutor
uses precisely thisModule
structure. -
Export the library and reload it into a later instance. In that case, you actually do the following:
- Build a shared library:
-
Reorganize the tree into DSO-Exportable and non-DSO-Exportable modules:
DSO-Exportable modules + - - - - - - - - - - - - - - - - - - - + +----------------------------------+ | | | BYOC output 2 (e.g. llvm module) | +----------------------------------+ | | ^ | (imports) + - - - - - + | | | +------------------+ (imports) +--------------------------------+ | | llvm or c output | | | BYOC output (e.g. cuda module) | +------------------+ +--------------------------------+ + - - - - - - - - -|- - - - + | +---------------------------------------+ | * Metadata Module (type_key=metadata) | +---------------------------------------+
-
Write each DSO-Exportable module to disk as e.g.
libN.o
-
Call
PackImportsToLLVM
, which serializes the remainder of the tree by callingModule::SaveToBinary
on each non-DSO-Exportable module and then writes the resulting blob todevc.o
-
Link
libN.o
anddevc.o
into a shared library.so
.
-
- Load the shared library
-
dlopen
the shared library to attach it to the TVM process. Place it insideLibraryModule
-
Look for a special symbol
__tvm_dev_mblob
, which was insidedevc.o
. If it exists, useProcessModuleBlob
to reconstruct the non-DSO-Exportable tree. Each Module is reconstructed using the PackedFuncruntime.loadbinary_<type_key>
. -
You are left with a new Module tree:
+------------------+ (imports) +--------------------------------+ | LibraryModule | | BYOC output (e.g. cuda module) | +------------------+ +--------------------------------+ | | +---------------------------------------+ | * Metadata Module (type_key=metadata) | +---------------------------------------+
-
Why is this bad?
- When the
.so
is built, the DSO-Exportable modules are linked together. Any weak symbols, extern symbols, etc may get resolved into a different DSO-Exportable module. It’s really hard to test that this may never happen in a bad way. - Any
loadbinary
function has to behave exactly inversely to theSaveToBinary
function which called it. It’s really hard to test this in all cases. - Because of these, inference may run differently between when a Module is first generated and when it’s deployed later on. Also, this is part of the process that we need to convey to microTVM developers. It’s incredibly complex and, in the µTVM case, impossible to avoid the side effects of (1).
Proposed Changes
Broadly
- TVM codegens (i.e. the builtin plus any BYOC
relay.ext.
) will produceArtifact
, notModule
. -
tvm.relay.build
will returnArtifactSet
in place ofModule
. - Define functions to store and load
Artifact
- Rework
export_library
to useArtifact
and around the load format discussed in the next bullet point. - Define an explicit load process that converts
ArtifactSet
toModule
. All code loading will be done in this way. - When you instantiate a GraphExecutor from GraphExecutorFactory, run the explicit code loading process to link a DSO, produce the
Module
tree.- Exception: when intending to use LLVM JIT, you can specify a new target
llvmjit
. When building against this target, you cannotexport_library
and you can only construct GraphExecutor in memory. This may be useful when TVM is e.g. a PyTorch in-memory backend.
- Exception: when intending to use LLVM JIT, you can specify a new target
The
target_host
Link StepSome
Artifact
(llvm
andc
) contain code that should be executed directly by the same CPU used to operate theGraphExecutor
. The main functional change this RFC proposes is:All exportable
llvm
andc
Artifact executed directly by thetarget_host
CPU need to be first linked into an.so
before being loaded.This step ensures that:
- The compiler artifact can be written to disk and reloaded without changing it; specifically, any linker side effects have occurred before the user tests the compiler artifacts.
- Our unit tests actually test what we could deploy, instead of an in-memory representation.
Handling LLVM JIT
Requiring all
llvm
Modules be linked before being executed could unnecessarily penalize the case where TVM serves as a backend to other frameworks. In this case, TVM is expected to compile and immediately run a function. Any export is done for debug purposes only—not because TVM’s export format is being used to restore the artifact for execution later on.To handle this case, a new target
llvmjit
will be introduced.llvmjit
produces a specialArtifact
which also retains an in-memory representation, which, during code loading, can be directly transferred to aModule
. ThisArtifact
can still be saved to disk, but is loaded throughloadbinary_llvmjit
, which reconstructs the in-memory representation through LLVM bitcode.-
Artifact#export
andArtifact#load
will directly translate between binary blob and Artifact with no processing done. -
ArtifactSet#export_library
will behave the similarly to before, but:- the “link” step will be made explicit in docs
-
tvm.runtime.load_module
will behave similarly to before, butProcessModuleBlob
will perform the Code Loading process identified below. -
GraphExecutorFactory#export_model_library_format
(not to be implemented in this RFC) would storeArtifactSet
in a directory tree directly.
The Code Loading Process
For the C++ runtime, the code loading process is explicitly defined to be:
- Partition the
Artifact
s contained in theArtifactSet
into groups according to theloader
attribute. - Create a top-level
MetadataModule
from theArtifact
withloader="metadata"
. - Create a “host”
LibraryModule
implementing thetarget_host
functions, which, for this initial RFC, are identified asArtifact
withloader="native"
.- If present, artifacts with
loader="native"
are linked into a DSO,native.so
, and loaded withLibraryModule
. - If no artifacts with
loader="native"
are present, this process assumesload_module
is called from a DSO. Create aLibraryModule
that wraps the DSO and use this in place of the[native.so](http://native.so)
module from step 1.
- If present, artifacts with
- Iterate over the other
Artifact
groups in order sorted by loader name, running the loader function defined asruntime.module.loadbinary_
. This function produces aModule
tree. Import theModule
tree into the host LibraryModule.
Changes to
export_library
export_library
will be changed as follows:- The
target_host
link step is already performed—no changes here other than for docs -
PackImportsTo*
will accept a binary blob produced by concatenatingArtifact#export
together.
Inspecting Artifacts
Exporting
Artifact
, andload_module
ChangesSpeaking of the export process, here is what will change:
- The DSO-Exportable portion will be largely the same: any
Artifact
withloader == "dso"
will be written to itsfilename
and become a part of the link process. - The Metadata part will change a bit, though really not much is actually changing (yet):
-
Artifact
from the samecodegen
with the sameloader
will be grouped together in__tvm_dev_mblob
. No tree will exist. Each field ofArtifact
is written to the file. - At load time, each group of
Artifact
in__tvm_dev_mblob
will be reconstructed intoArtifact
instances. The group ofArtifact
will be passed toloadbinary_<loader>
, which will returnModule
. - Returned
Module
will be imported into theMetadataModule
.
-
Future Directions
This is the first of two changes which enable BYOC with the C runtime to fit into Model Library Format. In part 2, we will:
- Introduce
target_key
, a shorthand that identifies a sub-target. Multipletarget_key
may have the same sub-target string, but represent distinct targets. For instance, there could be two identical FPGA which can be programmed differently to accelerate various workloads. - Add
target_key
toArtifact
. - Further group
Artifact
bytarget_key
at export and load time. In Model Library Format, prefix filenames withtarget_key
so e.g. FPGA bitfiles can be distinguished. - Modify the runtime API to specify
TVMDevice
as:target_key: List[TVMDevice]
. You can always have more device instances that implement thetarget_key
design.
- Build a shared library: