Model Library Format
Background
TVM’s build process for imported models centers around tvm.relay.build
, a function which produces a 3-tuple (graph_json, lib, params)
. The inference workflow then diverges depending on how the user wants to use the compiled artifacts:
- If the build targets the c++ runtime and uses the
llvm
backend…- and the user wants to run in the same Python instance used to compile: the user can directly instantiate a GraphRuntime instance.
- and the user wants to transfer the model to another Python runtime instance without cross-compiling: the user can call
lib.export_library()
, and storegraph_json
andparams
in some ad-hoc way. Then,tvm.runtime.load_module()
can recreatelib
in the new runtime instance. - and the user wants to transfer the model to another Python runtime instance with cross-compiling: the same procedure as above, but pass
fcompile
toexport_library
to specify the cross-compiler.
- If the build targets the c++ runtime and uses the
c
backend…- and the user wants to run the model with Python on similar architecture: the user must compile the produced
c
files to produce an artifact similar to the one produced bylib.export_library()
. Then, they can load and run the library following the procedure above. When saving and loading from the same instance (sograph_json
andparams
are not a consideration), this process is handled invisibly byloadfile_tar
. - and the user wants to run the model with Python on different architecture: same procedure as above, but with a cross-compiler.
- and the user wants to run the model with a different frontend language: same procedure as above, but the user must translate
graph_json
andparams
to a format suitable for the other language
- and the user wants to run the model with Python on similar architecture: the user must compile the produced
- If the build targets the c runtime…
- and the user wants to run the model with TVM in Python: not supported — Python supports C++ runtime only.
- and the user wants to run standalone: compile with
-system-lib
, store the library in a.tar
withexport_library()
, storeparams
andgraph_json
to disk in an ad-hoc way, unpack the tar and integrate all pieces into a standalone project. A smallmain
is needed to launch the C runtime, load the model and parameters, and run inference. Seeapps/bundle_deploy
.
In all cases except the first (compile and run in the same TVM instance), the user needs to serialize the tvm.relay.build
3-tuple before doing anything else. However, TVM provides no common function to handle this—it only directly handles serializing the compiled library. The user is left to store the parameters and runtime configuration (e.g. graph_json
) in a way that suits the task at hand. This discrepancy means that all the automation that consumes TVM artifacts from disk is always hand-written and specific to the situation.
On microTVM, we are preparing to introduce a Project-level API, implementations of which a) live in separate codebases from tvm
and b) build firmware images from the tvm.relay.build
artifacts. Due to this burden, the API needs to specify how all artifacts from tvm.relay.build
are placed on-disk.
To prepare for this API, we propose Model Library Format, a standard on-disk format for microTVM artifacts. microTVM primarily expects users to use the c
or llvm
backends with a cross-compiler, and build results may contain BYOC artifacts as well. As a secondary goal to this RFC, we make some considerations such that Model Library Format could be re-used as the standard on-disk format produced by tvmc
.
Goals
- Describe a standard way to serialize microTVM artifacts for use in downstream automation to compile them into firmware
- Describe how to implement a load API such as
tvm.runtime.load_module() -> GraphRuntimeFactory
. - Make considerations to accommodate other runtimes such as AOT and VM.
Non-Goals
- Immediately change the
tvmc
output format to Model Library Format for non-µTVM uses. The initial implementation is focused to microTVM only. - Decide how to serialize compilation flows unrelated to microTVM
Model Library Format
Model Library Format is a tar-archived directory tree. A sketch is as follows:
/
README.md - A short standardized README for new users plus human-readable metadata.json
metadata-<n>.json - Overall metadata describing this artifact; version <n>
crt/ - The content of standalone_crt from TVM build/
Makefile
include/
...
src/
...
codegen/ - Stores generated libraries in source or binary form
host/ - Generated code for target_host
lib/ - Generated binary object files
aot.o - Future home of AOT runtime generated code
devc.o - C++ MetadataModule artifact, unused in µTVM. Should get deleted.
lib0.o - LLVM module
lib1.o - LLVM CRT Metadata Module
src/ - Generated C source
devc.c - C++ MetadataModule artifact, unused in µTVM. Should get deleted.
lib0.c - C module
lib1.c - C CRT Metadata module
target_key/ - Additional directories for code which should get compiled for use on a target.
parameters/ - Stores simplified parameters
<model_name>.bson - BSON-serialized runtime parameters (optional)
<model_name>.params - tvm.relay._save_params format (always present)
<model_name>.json - JSON-serialized parameters (optional)
relay.txt - text representation of the relay model compiled, if built from Relay
runtime-config/ - Stores runtime configuration.
aot/ - AOT runtime config
(tbd)
graph/ - Graph runtime config
graph.json - Graph runtime JSON
metadata.json
The metadata file contains machine-parseable data describing the build. It also contains model-level information that is easier (right now) to parse as a single JSON document rather than split into many smaller purpose-specific files.
Following is a proposed schema:
{
"version": 1, // version of this document.
"model_name": "<model_name>", // model name, (passed as mod_name= to tvm.relay.build).
"export_datetime_utc": "%Y-%m-%d %H:%M:%SZ" // Time of export, in UTC.
"memory": {}, // configured memory map (see Memory Map)
"target": "", // TVM target string used to compile this artifact
"runtimes": ["graph"], // The runtimes that can launch this model.
}
Memory Map
In v1, the Memory Map will describe the buffers allocated by the GraphRuntime. As the memory planner is improved, this data structure will be expanded. Following is the schema for the “memory” key in v1:
[
{
"storage_id": <n>, // storage_id of the buffer, allocated by GraphRuntime
"size_bytes": <n>, // size of this buffer, in bytes
"input_binding": "" // when bound to a model input, the name of that input
},
// Additional entries
]
Building a Model Library Format
Here is the process by which TVM creates a Model Library Format from [tvm.relay.build](http://tvm.relay.build)
artifact. Here, graph_json
, lib
, and params
are the 3-tuple returned and target
is the TVM target. mkdir is assumed.
- If
target
contains--runtime=crt
, copy$tvm_root/build/standalone_crt
to./crt
. - Populate
./codegen
by callinglib.export_library()
, which should:- Collect all Modules that execute on the host and pass to
fcompile
. At present, these are those withtype_key()
ofc
orllvm
. When thec
target is used,fcompile
should copy the generated files into./codegen/host/src
instead of generating a.tar
. - (TODO, but not as a result of this RFC) Group the non-host modules by target_type (except that ext_dev target_types should be expanded to a unique key per BYOC). Save each generated module into a file underneath
./codegen/<target_type>
.
- Collect all Modules that execute on the host and pass to
- Populate
./parameters
.- Produce
<model_name>.params
withtvm.relay._save_params
. - Produce
<model_name>.json
with TBD (there doesn’t seem to be a standard in TVM, so I guess we’ll have to propose one)
- Produce
- Produce
relay.txt
withIRModule.get_source
- Produce
./runtime-config
as follows:- for GraphRuntime: save
graph.json
to./runtime-config/graph/graph.json
- for VM: TBD
- for AOT: TBD
- for GraphRuntime: save
- Produce
metadata-<n>.json
by building the required data structure and serializing to JSON.
Finally, the entire directory tree should be packaged into a TAR file with .model-lib
extension for easy transmission.
Implementation in TVM
The implementation of this RFC will initially consist of the following:
- Adding a new function,
tvm.runtime.Module#export_model_library_format
. This function implements the above procedure for runtimes which use thec
backend. - Placing the state necessary to implement
export_model_library_format
into GraphRuntimeCodegenModule, and making it accessible from Python. - Adding
loadfile_model_lib
which allows loadingtvm.runtime.GraphRuntimeFactoryModule
from the file produced byexport_model_library_format
. - Adding unit tests and changing apps/bundle_deploy to use this format as an example.
Following implementation of this RFC, another RFC (Project-level API for µTVM projects) will be submitted explaining how we intend to refactor the current interaction between TVM and µTVM runtime projects to allow for better portability. Also, tvmc
will begin creating Model Library Format for --runtime=c
targets.
µTVM Use Cases
Here I briefly walk through some µTVM use cases of Model Library Format to consider whether it’s a net improvement.
Building Host-Driven Firmware (µTVM)
At present, µTVM builds host-driven firmware (GraphRuntime instantiated on the host) as follows:
- The user instantiates an implementation of
tvm.micro.Compiler
. - TVM invokes
tvm.micro.Compiler#library
to compile each CRT sub-library and the code in./codegen/host
. - TVM invokes
tvm.micro.Compiler#binary
to build a binary firmware image including each library.
Following implementation of this change, the compilation flow will remain the same, but the CRT sources used will be taken from the Model Library Format tree.
Host-Driven Inference
At present, this is done from within the same Python script as called [tvm.relay.build](http://tvm.relay.build)
since it’s easier to keep all of the state in memory. This can be done with a separate python
invocation, but there is no standard function to load all of the state necessary, so it’s ad-hoc. Following this change, the GraphRuntimeFactoryModule can be loaded using tvm.runtime.load_module
, so it will be much easier to reconstruct the state needed for host-driven inference.
Building Standalone Firmware (e.g. apps/bundle_deploy
)
Currently, apps/bundle_deploy
invokes a custom Python script which produces artifacts in apps/bundle_deploy/build
. After this RFC, apps/bundle_deploy/build_model.py
will produce Model Library Format artifacts for the C-runtime compatible artifacts.
For apps/bundle_deploy
, the Makefile will be updated to reference the artifacts in standard locations. In the future, it will be possible to write a standard script to ingest generated code as a library into project build systems.
Future Work
We expect to make changes as future considerations are made in Model Library Format. Each time a change is made, the version number will be incremented. Here are some sketches of future topics that could be tackled.
Contexts
In heterogeneous execution, this object will describe the various DLContexts that TVM expects to be configured on the device. This RFC doesn’t seek to fully describe this key—heterogeneous execution is a future goal of µTVM, and until something more concrete is proposed there, this key will just contain an entry for DLContext(kDLCPU, 0)
.
Here is a strawman:
"contexts": [
{
"device_type": "cpu",
"device_id": 0,
},
{
"device_type": "ext_dev",
"device_id": 0,
"compiler": "accel_compiler_key",
"config": {
// device-specific config, populated by BYOC
},
},
], // configured DLContext (see DLContext configuration)
Models Targeted to the C++ Runtime
Models targeted to the C++ runtime have very similar structure to those targeted at the C runtime. The main difference is in how non-c
and non-llvm
(“non-DSO-Exportable”) modules are packaged.
The C+±runtime places all modules in a single shared library like a “fat binary.” At load time, it expects to find a constant __tvm_dev_mblob
which contains concatenated Module#save
from all of these modules. It then invokes a runtime.module.loadbinary_<type_key>
for each Module in __tvm_dev_mblob
.
In the C runtime, non-DSO-Exportable modules are typically created from BYOC flows and are meant to be executed by accelerators. Because RAM is typically quite precious on µC, the C runtime intends to make such generated BYOC code available to the downstream firmware build at compile time. Modules are grouped by target_type
one file is generated per Module containing Module#save
.
It’s possible that both approaches could be taken for C++ runtime to allow pre-compilation of Modules. However, the simplest and most likely way to move forward would be to create ./codegen/<model_name>.so
and avoid creating subdirectories. When the c
backend is used with the C++ runtime, ./codegen/host/src
could still be created, or the .tar
could be placed in ./codegen/<model_name.tar>
.
@tqchen @gromero @leandron @manupa-arm @mdw-octoml @jroesch @mjs @liangfu