Introduce Artifact, a container for generated code

areusch · September 22, 2021, 5:18pm

Posting up this pre-RFC for comments. For context it was written a few months back so apologies if things are slightly out of date.

Background

TVM’s high-level build function tvm.relay.build currently returns a runtime::Module instance meant to be “ready-to-run.” This means that tvm.relay.build (and its backends) is responsible for both code generation, code serialization and code loading. As these processes become more and more complex, runtime::Module must be overloaded to contain:

Generated kernels to implement TIR tasks
Metadata useful when loading and running the models (e.g. type of accelerator to target)
When Module contains source code:
- Source code for any downstream compilers
- Metadata to configure any downstream compilers e.g. gcc, cuda, etc

In some cases, things are simple enough that we haven’t observed many ill effects from this overload. In extreme cases, such as when targeting exotic runtime scenarios e.g. microTVM, this overloaded structure makes it difficult to expand the compiler while staying within the “ready-to-run” output expected of tvm.relay.build.

What is `Artifact`?

Artifact is a new TVM Object subclass that replaces runtime::Module as the return value from TVM codegen, including BYOC. To bridge the gap between Artifact and Module, this RFC proposes a new load_artifact process which makes the code loading process explicit in the TVM C++ runtime. That is to say, tvm.relay.build will stop returning runtime::Module and start returning ArtifactSet (e.g. a collection of Artifact) instead. Artifact is proposed to be defined as follows:

class Artifact : public ::tvm::runtime::Object {
 public:
  // Identifies the codegen that produced this artifact.
  std::string codegen_id;

  // Identifies the loader to use when loading this module
  std::string loader;

  // A file name unique within codegen_id.
  std::string file_name;

  // Binary content of this Artifact.
  std::string content;
};

Why do we need `Artifact`?

There are several different ways to motivate Artifact. Given the scale of this refactor, you might not immediately choose to do this based on any one of them. However, when taken together, my opinion is that these issues stem from an anti-pattern developing in the TVM compiler data structures.

Defining “Code Loading”

One motivator of Artifact is microTVM and the C runtime, where TVM is not necessarily producing binary code. Even when the llvm backend is used (producing a .o) the microTVM workflow is that the user must “load” the generated code by compiling the .o into a firmware binary image. The question is: what is that “load” process?

What we want to describe is the equivalent of tvm.runtime.load_module(mod.export_library())—just, with the load_module call is being done at firmware compile time (targeting a very tiny µC with no RAM). It’s particularly hard to explain this for a few reasons:

TVM actually has two code-loading processes now in normal Python-based C++ TVM runtime. When you load_module(mod.export_library()) today, you get something very different from what you got from tvm.relay.build, because load_module uses a different code path to construct Modules than is necessarily used in codegen. Neither of these processes make sense in a world where load_module is handled at firmware compile time.
export_library is designed to produce a .so for the C++ runtime. The .so contains a bunch of pieces (a few per codegen), and none of these pieces are named outside of the Module type_key. At present, we produce at least 2 c and 1-2 llvm modules already. It’s difficult to explain “move this Module to here” or “run a downstream translator on this Module” when they don’t have names.

Composite Targets

In some cases, code loading is straightforward (e.g. for llvm backend, link directly into binary). There are plenty of others, particularly with BYOC, where this is not true:

A system with two CPUs, a low-power simple CPU and high-power DSP. llvm must be configured twice and the output sorted into the different code memories for each CPU.
A system with many reconfigurable accelerators e.g. FPGAs or programmable DSP. Each accelerator instance would correspond to a DLDevice, but among those accelerators, configuration differences could complicate the code loading process.

None of these cases are a primary use case of TVM today, but the lack of metadata on TVM’s codegen outputs is a key obstacle to targeting systems such as these. And, none of these examples are rare or particularly strange designs.

Debugging

Module that are saved using SaveToBinary (e.g. type_key ≠ llvm or c) each implement their own serialization format. When a codegen produces multiple artifacts (e.g. ROCm, CUDA, Vitis-AI), the pattern has been to return a single Module containing all the artifacts and concatenate them. This is very difficult for a user to debug from outside TVM.

TVM could provide a standard facility to write the generated code to disk in a human-readable way, but this is hard because there is no metadata attached to each individual piece of the binary. This identifying metadata is the same metadata that users need to consume each piece separately I.e. when doing the code loading yourself.

`Module` re-use from codegen

Codegen must return generated code in a Module. The rules around which Module to use are confusing. When viewed from a runtime perspective, it seems perfectly clear that Module implementation could be re-used whenever generated functions are executed the same way. But, when viewed from the codegen perspective, it’s unclear why an e.g. CSourceModule should not be able to be used by any codegen producing C, particularly when it runs on a DLDevice other than the target_host.

The root issue is runtime.Module's dual roles as code container and runtime interface. Its metadata is limited to a single field type_key, which is essentially used during load_module to decide how to produce Module instance from the on-disk representation. It could be possible to add more metadata fields, but because of the dual roles of Module, it may overcomplicate implementations which essentially sit at the cross-product of (output_format, runtime_method).

Code Loading Today—Two Paths to a GraphExecutor

This section further explores the two different codepaths TVM uses to load kernel code.

Steps to `tvm.relay.build()`

  +----------+    +------------+   export_library()
  | TIR Task | -> |     Module |---------------|
  +----------+    +------------+               ↓
                        ↑                +-------------+
                        +----------------|    lib.so   |
                          load_module()  +-------------+

Right now, tvm.relay.build does the following (roughly—don’t hold me to this 100%):

Relay Scheduling. Each Relay operator is implemented into TE with a template schedule.
Optimization. TE is optimized, converted to TIR, and optimized again. e.g. operator fusion. A set of TIR tasks are produced (a task is one group of fused operators).
Graph Memory Planning. TIR Task inputs and outputs are assigned to buffers
Code Generation. TIR Tasks are passed to a code-generator PackedFunc named target.build.<kind>.
1. IR Transformation. The code-generator walks TIR and emits source code (c, cuda, etc backends) or another IR (llvm backend)
2. Compilation. In most cases, a model is compiled to bytecode. In some cases (e.g. c, cuda), compilation is skipped and done either at load time or when Module#GetFunction is called.
3. Module construction. A runtime::Module is created to hold the compiled artifact.

Output of `tvm.relay.build()`

The output of this process is a tree of Module. There are two possible topologies for this tree (* indicates the actual Module returned from tvm.relay.build):

Topology 1: With only DSO-Exportable Module (type_key in ("c", "llvm")):

                                   +----------------------------------+
                                   | BYOC output 2 (llvm or c module) |
                                   +----------------------------------+
                                                   ^
                                                   | (imports)
                                                   |
  +--------------------+  (imports)  +--------------------------------+
  | * llvm or c output |------------>| BYOC output (llvm or c module) |
  +--------------------+             +--------------------------------+

Topology 2: With some non-DSO-Exportable Module:

                                   +----------------------------------+
                                   | BYOC output 2 (e.g. llvm module) |
                                   +----------------------------------+
                                                   ^
                                                   | (imports)
                                                   |
  +------------------+  (imports)  +--------------------------------+
  | llvm or c output |             | BYOC output (e.g. cuda module) |
  +------------------+             +--------------------------------+
                 |                          |
             +---------------------------------------+
             | * Metadata Module (type_key=metadata) |
             +---------------------------------------+

(Suppose in this toy example that CUDA produces some code that runs on CUDA device and some that runs on the target_host CPU)

How the output is consumed

Now you can do two things:

Run inference straight away, by instantiating GraphExecutor. GraphExecutor uses precisely this Module structure.
Export the library and reload it into a later instance. In that case, you actually do the following:
1. Build a shared library:
  1. Reorganize the tree into DSO-Exportable and non-DSO-Exportable modules:
```
        DSO-Exportable modules
+ - - - - - - - - - - - - - - - - - - - +
  +----------------------------------+   |
| | BYOC output 2 (e.g. llvm module) |
  +----------------------------------+   |
|            ^                           
             | (imports)     + - - - - - +
|            |               |
   +------------------+  (imports)  +--------------------------------+
|  | llvm or c output |      |      | BYOC output (e.g. cuda module) |
   +------------------+             +--------------------------------+
+ - - - - - - - - -|- - - -  +              |
             +---------------------------------------+
             | * Metadata Module (type_key=metadata) |
             +---------------------------------------+
```
  2. Write each DSO-Exportable module to disk as e.g. libN.o
  3. Call PackImportsToLLVM, which serializes the remainder of the tree by calling Module::SaveToBinary on each non-DSO-Exportable module and then writes the resulting blob to devc.o
  4. Link libN.o and devc.o into a shared library .so.
2. Load the shared library
  1. dlopen the shared library to attach it to the TVM process. Place it inside LibraryModule
  2. Look for a special symbol __tvm_dev_mblob, which was inside devc.o. If it exists, use ProcessModuleBlob to reconstruct the non-DSO-Exportable tree. Each Module is reconstructed using the PackedFunc runtime.loadbinary_<type_key>.
  3. You are left with a new Module tree:
```
   +------------------+  (imports)  +--------------------------------+
   | LibraryModule    |             | BYOC output (e.g. cuda module) |
   +------------------+             +--------------------------------+
                   |                       |
             +---------------------------------------+
             | * Metadata Module (type_key=metadata) |
             +---------------------------------------+
```
Why is this bad?
1. When the .so is built, the DSO-Exportable modules are linked together. Any weak symbols, extern symbols, etc may get resolved into a different DSO-Exportable module. It’s really hard to test that this may never happen in a bad way.
2. Any loadbinary function has to behave exactly inversely to the SaveToBinary function which called it. It’s really hard to test this in all cases.
3. Because of these, inference may run differently between when a Module is first generated and when it’s deployed later on. Also, this is part of the process that we need to convey to microTVM developers. It’s incredibly complex and, in the µTVM case, impossible to avoid the side effects of (1).
Proposed Changes

Broadly
1. TVM codegens (i.e. the builtin plus any BYOC relay.ext.) will produce Artifact, not Module.
2. tvm.relay.build will return ArtifactSet in place of Module.
3. Define functions to store and load Artifact
4. Rework export_library to use Artifact and around the load format discussed in the next bullet point.
5. Define an explicit load process that converts ArtifactSet to Module. All code loading will be done in this way.
6. When you instantiate a GraphExecutor from GraphExecutorFactory, run the explicit code loading process to link a DSO, produce the Module tree.
  1. Exception: when intending to use LLVM JIT, you can specify a new target llvmjit. When building against this target, you cannot export_library and you can only construct GraphExecutor in memory. This may be useful when TVM is e.g. a PyTorch in-memory backend.
The target_host Link Step

Some Artifact (llvm and c) contain code that should be executed directly by the same CPU used to operate the GraphExecutor. The main functional change this RFC proposes is:

All exportable llvm and c Artifact executed directly by the target_host CPU need to be first linked into an .so before being loaded.

This step ensures that:
1. The compiler artifact can be written to disk and reloaded without changing it; specifically, any linker side effects have occurred before the user tests the compiler artifacts.
2. Our unit tests actually test what we could deploy, instead of an in-memory representation.
Handling LLVM JIT

Requiring all llvm Modules be linked before being executed could unnecessarily penalize the case where TVM serves as a backend to other frameworks. In this case, TVM is expected to compile and immediately run a function. Any export is done for debug purposes only—not because TVM’s export format is being used to restore the artifact for execution later on.

To handle this case, a new target llvmjit will be introduced. llvmjit produces a special Artifact which also retains an in-memory representation, which, during code loading, can be directly transferred to a Module. This Artifact can still be saved to disk, but is loaded through loadbinary_llvmjit, which reconstructs the in-memory representation through LLVM bitcode.
- Artifact#export and Artifact#load will directly translate between binary blob and Artifact with no processing done.
- ArtifactSet#export_library will behave the similarly to before, but:
  - the “link” step will be made explicit in docs
- tvm.runtime.load_module will behave similarly to before, but ProcessModuleBlob will perform the Code Loading process identified below.
- GraphExecutorFactory#export_model_library_format (not to be implemented in this RFC) would store ArtifactSet in a directory tree directly.
The Code Loading Process

For the C++ runtime, the code loading process is explicitly defined to be:
1. Partition the Artifacts contained in the ArtifactSet into groups according to the loader attribute.
2. Create a top-level MetadataModule from the Artifact with loader="metadata".
3. Create a “host” LibraryModule implementing the target_host functions, which, for this initial RFC, are identified as Artifact with loader="native".
  1. If present, artifacts with loader="native" are linked into a DSO, native.so, and loaded with LibraryModule.
  2. If no artifacts with loader="native" are present, this process assumes load_module is called from a DSO. Create a LibraryModule that wraps the DSO and use this in place of the [native.so](http://native.so) module from step 1.
4. Iterate over the other Artifact groups in order sorted by loader name, running the loader function defined as runtime.module.loadbinary_. This function produces a Module tree. Import the Module tree into the host LibraryModule.
Changes to export_library

export_library will be changed as follows:
- The target_host link step is already performed—no changes here other than for docs
- PackImportsTo* will accept a binary blob produced by concatenating Artifact#export together.
Inspecting Artifacts

Exporting Artifact, and load_module Changes

Speaking of the export process, here is what will change:
1. The DSO-Exportable portion will be largely the same: any Artifact with loader == "dso" will be written to its filename and become a part of the link process.
2. The Metadata part will change a bit, though really not much is actually changing (yet):
  1. Artifact from the same codegen with the same loader will be grouped together in __tvm_dev_mblob. No tree will exist. Each field of Artifact is written to the file.
  2. At load time, each group of Artifact in __tvm_dev_mblob will be reconstructed into Artifact instances. The group of Artifact will be passed to loadbinary_<loader>, which will return Module.
  3. Returned Module will be imported into the MetadataModule.
Future Directions

This is the first of two changes which enable BYOC with the C runtime to fit into Model Library Format. In part 2, we will:
1. Introduce target_key, a shorthand that identifies a sub-target. Multiple target_key may have the same sub-target string, but represent distinct targets. For instance, there could be two identical FPGA which can be programmed differently to accelerate various workloads.
2. Add target_key to Artifact.
3. Further group Artifact by target_key at export and load time. In Model Library Format, prefix filenames with target_key so e.g. FPGA bitfiles can be distinguished.
4. Modify the runtime API to specify TVMDevice as: target_key: List[TVMDevice]. You can always have more device instances that implement the target_key design.

areusch · September 22, 2021, 5:21pm

cc @tqchen @jroesch @junrushao @mbs-octoml @comaniac @ramana-arm @manupa-arm @mjs @mousius @kparzysz @leandron

tqchen · September 22, 2021, 8:59pm

also cc @zhiics @yzhliu @Laurawly @giuseros

tqchen · September 22, 2021, 9:37pm

Trying to capture some of the past discussions. I think it is useful to introduce an Artifact as an intermediate stage between compile and runtime.

There are two possible design choices in general

C0: Artifact as the plain data. This is the choice being proposed. The advantage being making the memory.
C1: Artifact as an abstract interface. The advantage being offering a bit more flexibility especially for things that would requires in-memory.

To give a rough demonstration of what a C1 type interface might looks like.

class Artifact {
  public:
     vitural runtime::Module JIT() = 0;
     vitural void Export(String name) = 0;
     virtual String type_key() const = 0;
};

Note that not every kind of artifact could support JIT(e.g. LLVMJIT is not exportable). The main advantage of having an abstract class is that it will allow us to enable in-memory representation that are not necessarily serializable. For example, we could decide add a function that calls into a runtime PackedFunc which is backed by a python function, and there is not effective way to do such serialization.

From the API’s pov, the overall flow becomes

artifacts = build(...)
runtime_mod = artifacts.jit()
artifacts.export_library("xyz.so")
runtime_mod = tvm.runtime.load_module("xyz.so")

areusch · September 22, 2021, 10:33pm

@tqchen a challenge here is that currently many runtime::Module maintain multiple Artifact internally which, for debugging purposes, belong in 3 separate files. Consider CUDAModule as an example:

class CUDAModuleNode : public runtime::ModuleNode {
  // the binary data
  std::string data_;
  // The format
  std::string fmt_;
  // function information table.
  std::unordered_map<std::string, FunctionInfo> fmap_;

this is typical of most non-DSO-exportable runtime::Module. In this case, data_ contains cuda source. If someone wanted to inspect that today, they have to manually deserialize the binary format created by SaveToBinary:

  void SaveToBinary(dmlc::Stream* stream) final {
    stream->Write(fmt_);
    stream->Write(fmap_);
    stream->Write(data_);
  }

This is quite painful. Artifact proposes we split the module into 3 Artifacts with the same cuda loader. Then, they can be serialized/deserialized using a mechanism which is orthogonal to the loading process.

In doing this, placing the JIT call on Artifact doesn’t make sense, as multiple such Artifact may be required to produce a runtime::Module.

Consider also the case of BYOC which may produce an executor-side launcher function using llvm or c backend plus some additional source code or binary data (e.g. considering the case of Ethos-U, the “command stream”) meant to be loaded and interpreted by a PE other than the CPU. In this case, two different loading processes are needed: one to load the executor-side code, and one to load and configure the accelerator.

The current C+±runtime approach works around this by pre-linking the accelerator driver into libtvm.so in the form of e.g. CUDAModule. In the static deployment case, it may be necessary to model this loading process or at least indicate to the user which aspects of the BYOC codegen belong to the same domain as the executor and which should be supplied to the accelerator loader.

The main advantage of having an abstract class is that it will allow us to enable in-memory representation that are not necessarily serializable

Could you give an example? I think we discussed two cases before:

llvm when we merely only want to JIT. In this case LLVM bitcode is codegen’d and the LLVM APIs are invoked to compile such code to machine code. This bypasses the typical link process and is therefore faster. Arguably there is a concern that the link process may be separate, but depending on the actual bitcode generated (e.g. if there are no external references), it may be ok. To support this, we introduced llvmjit. However, I didn’t update the doc above in light of that conversation to properly describe that.
the PyTorch backend. In this case, a Python function is produced by the BYOC codegen and such function is never serializable. You are right we cannot handle this case with the current Artifact proposal.

Perhaps then we need to introduce another class InMemoryArtifact. If a codegen produces InMemoryArtifact, the entire result of compilation cannot be serialized.

class InMemoryArtifact {
  public:
   // Identifies the codegen that produced this artifact.
   std::string codegen_id;

   // A file name unique within codegen_id.
   std::string file_name;

   virtual runtime::Module Load() = 0;
};

During the load process, InMemoryArtifact is considered to be its own Loader and the load is accomplished by calling Load. The resulting runtime::Module is inserted into the tree as would the result of any other load. No additional arguments or Artifacts can be supplied to Load().

Debuggability/Inspection

I believe we also discussed introspection on Artifacts e.g. how should someone debug a generated Artifact? I have actually not resolved this question. Here is motivation and perhaps the community has ideas:

Suppose someone generates JSON. it should be easy to pretty-print said JSON.
Supposed someone generates LLVM bitcode. it should be easy to pretty-print either the .ll or native assembly, no matter which backend codegen’d it.
Suppose someone generates a binary format of their own design and wants to contribute a visualizer for debug purposes.

It seems like such an “inspector” should be able to be written in either C++ or a frontend language. For example, JSON pretty-printing is a one-liner in Python but a much more complicated proposition in C++. This sort of motivates creating yet another PackedFunc prefix table…

tvm.artifact.inspect_json_prettyprint → inspect artifacts whose filename ends with “.json” by pretty-printing. Signature (Artifact) -> str.
tvm.artifact.inspect_pytorch_repr → inspect pytorch InMemoryArtifact using repr. Signature: (InMemoryArtifact) -> str
tvm.artifact.inspect_mycodegenid_bin_default → inspect artifacts whose file_name ends with .bin created by codegen_id using default inspection method. Signature: (Artifact) -> str

I don’t love this tbh, so far just using it to think through the possibilities.

Also it may be true that a loader or codegen could define a top-level inspector which consumes all the artifacts generated by a particular codegen or loaded by a particular loader.

tvm.codegen.inspect_llvm_list-functions - list all functions produced by LLVM codegen in this ArtifactSet
tvm.loader.inspect_cuda_print-signatures - list all function signatures loaded with CUDA loader in this ArtifactSet

would welcome community input on the whole proposal and also on these points

Introduce Artifact, a container for generated code

Background

What is `Artifact`?

Why do we need `Artifact`?

Defining “Code Loading”

Composite Targets

Debugging

`Module` re-use from codegen

Code Loading Today—Two Paths to a GraphExecutor

Steps to `tvm.relay.build()`

Output of `tvm.relay.build()`

How the output is consumed

Why is this bad?

Proposed Changes

Broadly

The `target_host` Link Step

Handling LLVM JIT

The Code Loading Process

Changes to `export_library`

Inspecting Artifacts

Exporting `Artifact`, and `load_module` Changes

Future Directions

Debuggability/Inspection

Introduce Artifact, a container for generated code

Background

What is Artifact?

Why do we need Artifact?

Defining “Code Loading”

Composite Targets

Debugging

Module re-use from codegen

Code Loading Today—Two Paths to a GraphExecutor

Steps to tvm.relay.build()

Output of tvm.relay.build()

How the output is consumed

Why is this bad?

Proposed Changes

Broadly

The target_host Link Step

Handling LLVM JIT

The Code Loading Process

Changes to export_library

Inspecting Artifacts

Exporting Artifact, and load_module Changes

Future Directions

Debuggability/Inspection

What is `Artifact`?

Why do we need `Artifact`?

`Module` re-use from codegen

Steps to `tvm.relay.build()`

Output of `tvm.relay.build()`

The `target_host` Link Step

Changes to `export_library`

Exporting `Artifact`, and `load_module` Changes