Bring your own codegen to TVM + Graph Partitioning
The goal is to come up with a right Relay subgraph data structure/abstraction so that we can more conveniently allow thrid-party library and hardware vendors to bring their own codegen tools to TVM.
This RFC involves design and implementation in the following aspects at least.
- Graph coloring
- Providing HW vendors an infra to customize where they want to execute an op.
- Graph partitioning
- A Relay pass that partitions a program into segments that could be executed on various hardware platforms.
- Code generation
- Generate code for each segment of a partition Relay program.
- Artifact serialization
- Provide functionality to support save/load of the compiled artifacts.
- Runtime
- Integrate other runtimes/execution engines or invoke the external library code/subgraph through both graphruntime and VM (the current POC implementation is using VM).
Model Coverage
- CNN: MLP, VGG, ResNet, SqueezeNet, Inception V3, etc.
- CV: SSD with ResNet 50, MobileNet, VGG-16, etc.
- NLP models are not supported well yet in Relay so we will revisit them in the future.
- And more…
Coloring - Group nodes with the annotation to a minimum Number of subgraphs.
-
Problem Formulation
-
Input
- Given a Relay graph with extern op annotations (added by users or by some internal mechanisms).
- The I/O of each node (op) may or may not have annotations to indicate if this node is suggested to be offloaded.
-
Output
- A graph with minimum annotations on edges indicating the boundary of subgraphs.
-
-
Implementation 1: Op-level annotation
- For each op, we have a corresponding check function registered and the checker will be invoked at the compilation time to indicate if we should annotate the op for the 3rd party accelerator to offload. For example, the following shows a checker of
conv2d
:
*@reg.register_extern_op("nn.conv2d") def conv2d(attrs, args, comp): return get_extern_op(comp, "conv2d")(attrs, args)
- Note that
comp
is a string to represent the 3rd party compiler name; theget_extern_op
useshasattr
andgetattr
to obtain the 3rd party specified checkers.
- Note that
- For HW partners/3rd party library, they only need to implement simply checker functions for each op to specify if they could support an op under certain conditions. The following example shows a case that the accelerator only supports
conv2d
with floating types.
*def conv2d(attrs, args): type = args[0].output_type_.dtype return (type == 'float32' or type == 'float64')
- Note that HW partners do not need to register this function but just need to implement it under Relay backend/contrib/compiler_name so that the function can be discovered and imported dynamically.
- A Relay IR pass in Python will invoke above function, insert annotations to the graph, and run Algorithm 1 for coloring.
- For each op, we have a corresponding check function registered and the checker will be invoked at the compilation time to indicate if we should annotate the op for the 3rd party accelerator to offload. For example, the following shows a checker of
-
Implementation 2: Subgraph-level annotation
- We also provide an option for HW partners to annotate the graph directly. In this case, they have to implement a Relay IR pass with a use of our APIs to annotate boundary annotations (i.e.,
subgraph_start
andsubgraph_end
).
- We also provide an option for HW partners to annotate the graph directly. In this case, they have to implement a Relay IR pass with a use of our APIs to annotate boundary annotations (i.e.,
Partitioning - Check/Validate the graph and process graph I/Os.
- Problem Formulation
- Input
- Given a Relay program with boundary annotations (i.e.,
subgraph_start
andsubgraph_end
). - The boundary annotations can be added by the coloring stage. In this case, the boundary annotations are always valid.
- Users can directly add boundary annotations to their Relay programs. In this case, we need to validate the annotations before partitioning.
- Given a Relay program with boundary annotations (i.e.,
- Output
- The updated Relay program with subgraphs replaced with sub functions. All annotations should be removed and calls should be inserted to invoked the sub functions.
- Input
Codegen - To tell the Relay backend to use external codegen instead of TVM.
- Invoke different codegen tools from TVM directly. This needs HW partners to register their codegen tool to TVM as a runtime module.
- During compilation, we can traverse the graph and check the attributes of different subgraphs. For example, an external codegen tool has to be invoked once we found that the attribute of subgraph is annotated with an external compiler. For the example above, we can generate a runtime module for 1x1 conv, but we have to invoke external compilers to generate code for the two subgraphs.
- How to register?
-
HW vendors need to register their compiler as a runtime module and at least be able to deal with the following tasks
- Ingest a Relay function/module and compile it.
- Ingest TVM input data structures, e.g. NDArray. TVM feeds data in the NDArray format to the subgraph and expects the external accelerator to execute it and return the output in the NDArray as well. Therefore, HW vendors will need consider the conversion of TVM data to whatever data that is compatible to their compiler.
- Implement the virtual functions of a
runtime::ModuleNode
, i.e.SaveToFile
,SaveToBinary
,GetSource
,GetFunction
, etc.GetFunction
is particular important because that’s how we could get the function ptr of a subgraph and invoke it during runtime. An example for the registration of CUDA runtime module is here: https://github.com/dmlc/tvm/blob/master/src/runtime/cuda/cuda_module.cc
-
HW vendors need to register their compiler as a runtime module and at least be able to deal with the following tasks
- What APIs we need to expose?
- The major APIs would be similar to other codegen tools that currently baked into TVM, i.e. LLVM and CUDA, etc.
- How to register?
Serialization - Save the subgraphs and load them back
- TVM serializes the built artifact into json, params, and library. What do the subgraphs bring us? Each HW vendor has their own artifacts. For example, they may encode the structure of the subgraph into the library, they may need and even modify the params.
- Serialize the partitioned subgraphs into a form to save on disk.
- Need to let HW partners know what ops are in the subgraph? We should treat a subgraph as a black box, but just feed it with input data and expect to get the correct output from external libraries.
- How many libraries? We may generate multiple libraries one for each backend.
- How to load multiple libraries and let the subgraph invoke the correct library?
- Can we combine them into a fat library if the external codegen tool is registered to TVM as a runtime module?
Runtime - Invoke different subgraphs using different libraries
- Graph runtime and VM runtime.
- Offload a subgraph to the 3rd party library
- How to invoke the library and let it take control of the subgraph?
- Two cases
- HW vendors have their own runtime.
- How to coordinate two runtimes?
- HW vendors don’t have their runtime.
- Only use TVM runtime. We still need the library generated by the external compiler to be able to ingest TVM runtime data and finish the execution of a subgraph.
- HW vendors have their own runtime.
We have an initial implementation here: https://github.com/zhiics/tvm/tree/partitioning, where we provided support for MKLDNN using DNNL execution enigne and a simple experimental version to allow GCC to ingest NDArray and compile a simple graph. Thanks @jroesch for providing many suggestions. Also part of credits should go to @comaniac for working together.
Any comments and thoughts are welcome:)
@tqchen @wweic @haichen @thierry @ajtulloch @jonso @janimesh @ciphr