Posting up this pre-RFC for comments. For context it was written a few months back so apologies if things are slightly out of date.
Background
TVM’s high-level build function tvm.relay.build currently returns a runtime::Module instance meant to be “ready-to-run.” This means that tvm.relay.build (and its backends) is responsible for both code generation, code serialization and code loading. As these processes become more and more complex, runtime::Module must be overloaded to contain:
- Generated kernels to implement TIR tasks
- Metadata useful when loading and running the models (e.g. type of accelerator to target)
- When
Modulecontains source code:- Source code for any downstream compilers
- Metadata to configure any downstream compilers e.g.
gcc,cuda, etc
In some cases, things are simple enough that we haven’t observed many ill effects from this overload. In extreme cases, such as when targeting exotic runtime scenarios e.g. microTVM, this overloaded structure makes it difficult to expand the compiler while staying within the “ready-to-run” output expected of tvm.relay.build.
What is Artifact?
Artifact is a new TVM Object subclass that replaces runtime::Module as the return value from TVM codegen, including BYOC. To bridge the gap between Artifact and Module, this RFC proposes a new load_artifact process which makes the code loading process explicit in the TVM C++ runtime. That is to say, tvm.relay.build will stop returning runtime::Module and start returning ArtifactSet (e.g. a collection of Artifact) instead. Artifact is proposed to be defined as follows:
class Artifact : public ::tvm::runtime::Object {
public:
// Identifies the codegen that produced this artifact.
std::string codegen_id;
// Identifies the loader to use when loading this module
std::string loader;
// A file name unique within codegen_id.
std::string file_name;
// Binary content of this Artifact.
std::string content;
};
Why do we need Artifact?
There are several different ways to motivate Artifact. Given the scale of this refactor, you might not immediately choose to do this based on any one of them. However, when taken together, my opinion is that these issues stem from an anti-pattern developing in the TVM compiler data structures.
Defining “Code Loading”
One motivator of Artifact is microTVM and the C runtime, where TVM is not necessarily producing binary code. Even when the llvm backend is used (producing a .o) the microTVM workflow is that the user must “load” the generated code by compiling the .o into a firmware binary image. The question is: what is that “load” process?
What we want to describe is the equivalent of tvm.runtime.load_module(mod.export_library())—just, with the load_module call is being done at firmware compile time (targeting a very tiny µC with no RAM). It’s particularly hard to explain this for a few reasons:
- TVM actually has two code-loading processes now in normal Python-based C++ TVM runtime. When you
load_module(mod.export_library())today, you get something very different from what you got fromtvm.relay.build, becauseload_moduleuses a different code path to construct Modules than is necessarily used in codegen. Neither of these processes make sense in a world whereload_moduleis handled at firmware compile time. export_libraryis designed to produce a.sofor the C++ runtime. The.socontains a bunch of pieces (a few per codegen), and none of these pieces are named outside of theModuletype_key. At present, we produce at least 2cand 1-2llvmmodules already. It’s difficult to explain “move this Module to here” or “run a downstream translator on this Module” when they don’t have names.
Composite Targets
In some cases, code loading is straightforward (e.g. for llvm backend, link directly into binary). There are plenty of others, particularly with BYOC, where this is not true:
- A system with two CPUs, a low-power simple CPU and high-power DSP.
llvmmust be configured twice and the output sorted into the different code memories for each CPU. - A system with many reconfigurable accelerators e.g. FPGAs or programmable DSP. Each accelerator instance would correspond to a
DLDevice, but among those accelerators, configuration differences could complicate the code loading process.
None of these cases are a primary use case of TVM today, but the lack of metadata on TVM’s codegen outputs is a key obstacle to targeting systems such as these. And, none of these examples are rare or particularly strange designs.
Debugging
Module that are saved using SaveToBinary (e.g. type_key ≠ llvm or c) each implement their own serialization format. When a codegen produces multiple artifacts (e.g. ROCm, CUDA, Vitis-AI), the pattern has been to return a single Module containing all the artifacts and concatenate them. This is very difficult for a user to debug from outside TVM.
TVM could provide a standard facility to write the generated code to disk in a human-readable way, but this is hard because there is no metadata attached to each individual piece of the binary. This identifying metadata is the same metadata that users need to consume each piece separately I.e. when doing the code loading yourself.
Module re-use from codegen
Codegen must return generated code in a Module. The rules around which Module to use are confusing. When viewed from a runtime perspective, it seems perfectly clear that Module implementation could be re-used whenever generated functions are executed the same way. But, when viewed from the codegen perspective, it’s unclear why an e.g. CSourceModule should not be able to be used by any codegen producing C, particularly when it runs on a DLDevice other than the target_host.
The root issue is runtime.Module’s dual roles as code container and runtime interface. Its metadata is limited to a single field type_key, which is essentially used during load_module to decide how to produce Module instance from the on-disk representation. It could be possible to add more metadata fields, but because of the dual roles of Module, it may overcomplicate implementations which essentially sit at the cross-product of (output_format, runtime_method).
Code Loading Today—Two Paths to a GraphExecutor
This section further explores the two different codepaths TVM uses to load kernel code.
Steps to tvm.relay.build()
+----------+ +------------+ export_library()
| TIR Task | -> | Module |---------------|
+----------+ +------------+ ↓
↑ +-------------+
+----------------| lib.so |
load_module() +-------------+
Right now, tvm.relay.build does the following (roughly—don’t hold me to this 100%):
- Relay Scheduling. Each Relay operator is implemented into TE with a template schedule.
- Optimization. TE is optimized, converted to TIR, and optimized again. e.g. operator fusion. A set of TIR tasks are produced (a task is one group of fused operators).
- Graph Memory Planning. TIR Task inputs and outputs are assigned to buffers
- Code Generation. TIR Tasks are passed to a code-generator PackedFunc named
target.build.<kind>.- IR Transformation. The code-generator walks TIR and emits source code (
c,cuda, etc backends) or another IR (llvmbackend) - Compilation. In most cases, a model is compiled to bytecode. In some cases (e.g.
c,cuda), compilation is skipped and done either at load time or whenModule#GetFunctionis called. Moduleconstruction. Aruntime::Moduleis created to hold the compiled artifact.
- IR Transformation. The code-generator walks TIR and emits source code (
Output of tvm.relay.build()
The output of this process is a tree of Module. There are two possible topologies for this tree (* indicates the actual Module returned from tvm.relay.build):
-
Topology 1: With only DSO-Exportable Module (
type_keyin("c", "llvm")):+----------------------------------+ | BYOC output 2 (llvm or c module) | +----------------------------------+ ^ | (imports) | +--------------------+ (imports) +--------------------------------+ | * llvm or c output |------------>| BYOC output (llvm or c module) | +--------------------+ +--------------------------------+ -
Topology 2: With some non-DSO-Exportable Module:
+----------------------------------+ | BYOC output 2 (e.g. llvm module) | +----------------------------------+ ^ | (imports) | +------------------+ (imports) +--------------------------------+ | llvm or c output | | BYOC output (e.g. cuda module) | +------------------+ +--------------------------------+ | | +---------------------------------------+ | * Metadata Module (type_key=metadata) | +---------------------------------------+(Suppose in this toy example that CUDA produces some code that runs on CUDA device and some that runs on the target_host CPU)
How the output is consumed
Now you can do two things:
-
Run inference straight away, by instantiating
GraphExecutor.GraphExecutoruses precisely thisModulestructure. -
Export the library and reload it into a later instance. In that case, you actually do the following:
- Build a shared library:
-
Reorganize the tree into DSO-Exportable and non-DSO-Exportable modules:
DSO-Exportable modules + - - - - - - - - - - - - - - - - - - - + +----------------------------------+ | | | BYOC output 2 (e.g. llvm module) | +----------------------------------+ | | ^ | (imports) + - - - - - + | | | +------------------+ (imports) +--------------------------------+ | | llvm or c output | | | BYOC output (e.g. cuda module) | +------------------+ +--------------------------------+ + - - - - - - - - -|- - - - + | +---------------------------------------+ | * Metadata Module (type_key=metadata) | +---------------------------------------+ -
Write each DSO-Exportable module to disk as e.g.
libN.o -
Call
PackImportsToLLVM, which serializes the remainder of the tree by callingModule::SaveToBinaryon each non-DSO-Exportable module and then writes the resulting blob todevc.o -
Link
libN.oanddevc.ointo a shared library.so.
-
- Load the shared library
-
dlopenthe shared library to attach it to the TVM process. Place it insideLibraryModule -
Look for a special symbol
__tvm_dev_mblob, which was insidedevc.o. If it exists, useProcessModuleBlobto reconstruct the non-DSO-Exportable tree. Each Module is reconstructed using the PackedFuncruntime.loadbinary_<type_key>. -
You are left with a new Module tree:
+------------------+ (imports) +--------------------------------+ | LibraryModule | | BYOC output (e.g. cuda module) | +------------------+ +--------------------------------+ | | +---------------------------------------+ | * Metadata Module (type_key=metadata) | +---------------------------------------+
-
Why is this bad?
- When the
.sois built, the DSO-Exportable modules are linked together. Any weak symbols, extern symbols, etc may get resolved into a different DSO-Exportable module. It’s really hard to test that this may never happen in a bad way. - Any
loadbinaryfunction has to behave exactly inversely to theSaveToBinaryfunction which called it. It’s really hard to test this in all cases. - Because of these, inference may run differently between when a Module is first generated and when it’s deployed later on. Also, this is part of the process that we need to convey to microTVM developers. It’s incredibly complex and, in the µTVM case, impossible to avoid the side effects of (1).
Proposed Changes
Broadly
- TVM codegens (i.e. the builtin plus any BYOC
relay.ext.) will produceArtifact, notModule. tvm.relay.buildwill returnArtifactSetin place ofModule.- Define functions to store and load
Artifact - Rework
export_libraryto useArtifactand around the load format discussed in the next bullet point. - Define an explicit load process that converts
ArtifactSettoModule. All code loading will be done in this way. - When you instantiate a GraphExecutor from GraphExecutorFactory, run the explicit code loading process to link a DSO, produce the
Moduletree.- Exception: when intending to use LLVM JIT, you can specify a new target
llvmjit. When building against this target, you cannotexport_libraryand you can only construct GraphExecutor in memory. This may be useful when TVM is e.g. a PyTorch in-memory backend.
- Exception: when intending to use LLVM JIT, you can specify a new target
The
target_hostLink StepSome
Artifact(llvmandc) contain code that should be executed directly by the same CPU used to operate theGraphExecutor. The main functional change this RFC proposes is:All exportable
llvmandcArtifact executed directly by thetarget_hostCPU need to be first linked into an.sobefore being loaded.This step ensures that:
- The compiler artifact can be written to disk and reloaded without changing it; specifically, any linker side effects have occurred before the user tests the compiler artifacts.
- Our unit tests actually test what we could deploy, instead of an in-memory representation.
Handling LLVM JIT
Requiring all
llvmModules be linked before being executed could unnecessarily penalize the case where TVM serves as a backend to other frameworks. In this case, TVM is expected to compile and immediately run a function. Any export is done for debug purposes only—not because TVM’s export format is being used to restore the artifact for execution later on.To handle this case, a new target
llvmjitwill be introduced.llvmjitproduces a specialArtifactwhich also retains an in-memory representation, which, during code loading, can be directly transferred to aModule. ThisArtifactcan still be saved to disk, but is loaded throughloadbinary_llvmjit, which reconstructs the in-memory representation through LLVM bitcode.Artifact#exportandArtifact#loadwill directly translate between binary blob and Artifact with no processing done.ArtifactSet#export_librarywill behave the similarly to before, but:- the “link” step will be made explicit in docs
tvm.runtime.load_modulewill behave similarly to before, butProcessModuleBlobwill perform the Code Loading process identified below.GraphExecutorFactory#export_model_library_format(not to be implemented in this RFC) would storeArtifactSetin a directory tree directly.
The Code Loading Process
For the C++ runtime, the code loading process is explicitly defined to be:
- Partition the
Artifacts contained in theArtifactSetinto groups according to theloaderattribute. - Create a top-level
MetadataModulefrom theArtifactwithloader="metadata". - Create a “host”
LibraryModuleimplementing thetarget_hostfunctions, which, for this initial RFC, are identified asArtifactwithloader="native".- If present, artifacts with
loader="native"are linked into a DSO,native.so, and loaded withLibraryModule. - If no artifacts with
loader="native"are present, this process assumesload_moduleis called from a DSO. Create aLibraryModulethat wraps the DSO and use this in place of the[native.so](http://native.so)module from step 1.
- If present, artifacts with
- Iterate over the other
Artifactgroups in order sorted by loader name, running the loader function defined asruntime.module.loadbinary_. This function produces aModuletree. Import theModuletree into the host LibraryModule.
Changes to
export_libraryexport_librarywill be changed as follows:- The
target_hostlink step is already performed—no changes here other than for docs PackImportsTo*will accept a binary blob produced by concatenatingArtifact#exporttogether.
Inspecting Artifacts
Exporting
Artifact, andload_moduleChangesSpeaking of the export process, here is what will change:
- The DSO-Exportable portion will be largely the same: any
Artifactwithloader == "dso"will be written to itsfilenameand become a part of the link process. - The Metadata part will change a bit, though really not much is actually changing (yet):
Artifactfrom the samecodegenwith the sameloaderwill be grouped together in__tvm_dev_mblob. No tree will exist. Each field ofArtifactis written to the file.- At load time, each group of
Artifactin__tvm_dev_mblobwill be reconstructed intoArtifactinstances. The group ofArtifactwill be passed toloadbinary_<loader>, which will returnModule. - Returned
Modulewill be imported into theMetadataModule.
Future Directions
This is the first of two changes which enable BYOC with the C runtime to fit into Model Library Format. In part 2, we will:
- Introduce
target_key, a shorthand that identifies a sub-target. Multipletarget_keymay have the same sub-target string, but represent distinct targets. For instance, there could be two identical FPGA which can be programmed differently to accelerate various workloads. - Add
target_keytoArtifact. - Further group
Artifactbytarget_keyat export and load time. In Model Library Format, prefix filenames withtarget_keyso e.g. FPGA bitfiles can be distinguished. - Modify the runtime API to specify
TVMDeviceas:target_key: List[TVMDevice]. You can always have more device instances that implement thetarget_keydesign.
- Build a shared library: