RFC: TE Compiler
- Feature Name: (
te_compiler
) - Start Date: (fill me in with today’s date, 2020-02-24)
- RFC PR: apache/tvm-rfcs#7518
- GitHub Issue: apache/tvm#0000
Summary
The goal of this RFC is remove the existing interface between Relay and TIR, the CompileEngine
class. The removal of this class will enable the entire program to be compiled as a unified IRModule allowing users to transform multiple kernels and Relay functions simultaneously and give more end-user control over the lowering of TE into TIR.
Motivation
Our motivation is to unify TIR compilation and Relay compilation so that we can uniformly transform and analyze the entire program from graph-level to kernel-level. The old CompileEngine
was designed in a completely different era of TVM and is not well suited for current efforts and incoming reactors such as AutoTIR and TensorIR.
The current design compiles Relay primitive functions via a callback into the compile engine. The compile engine compiles each primitive function in complete isolation limiting the ability to analyze or optimize across them. By replacing Relay primitive function calls with TIR primitive function calls that contain the lowered TIR we enable users to customize the compilation flow after lowering instead of providing a fixed compilation pipeline exposed by CompileEngine. Previously the code would be lowered from Relay primitive functions directly to packed functions limiting the user ability to customize what happens after lowering but before runtime.
Guide-level explanation
The high level change will mean that instead of the compile engine being an invisible piece of machinery invoked by the backends such as the graph runtime, VM or AoT it will function as a IRModule to IRModule pass.
For example in the current PR you can lower all Relay “primitive functions” (functions marked with the “Primitive” attribute) directly into TIR by invoking the LowerTE
pass on an IRModule.
This means you can simply do:
auto lowered_mod = LowerTE()(module);
This enables:
- An intermediate stage in the lowering process where Relay and TIR coexist.
- The ability to add passes at this intermediate stage,
- For example memory planning which can infer user provided information from TE and the resulting TIR.
The current implementation is a bit more complex then this as we are incrementally refactoring the code as this is a large change and will effect most compilation and runtime flows.
Reference-level explanation
Currently the compile engine is consumed in all Relay runtimes including the interpreter, graph runtime, VM and any AoT efforts going forward.
Our proposed design is to take all current uses of the CompileEngine and replace them with a new pass based wrapper which simply generates all the lowered functions which can then be added back to the module and compiled as a single unit.
This is a complex refactor and requires a few steps, first we will introduce an temporary state where we introduce the new API and level the existing API in place. We will then migrate each current client of the CompileEngine to the new API before deleting the code.
We are starting with a proof of concept by refactoring the GraphRuntimeCodegen to use a newly introduced TE compiler instead of the compile engine directly. In the new flow
- The TE/TIR compiler lowers TE in the LowerTensorExpr pass
- Replaces relay.Function(attr:primitive) with a function call to the a GlobalVar pointing to a TIR function.
- Runs GraphPlanMemory planning as usual.
- Finally runs GraphRuntimeCodegen::VisitExpr to lower to graph JSON.
One challenge that remains is the BYOC flow produces runtime modules currently in the compile engine and it might make sense to split these out into a secondary pass which generates the runtime::Modules
directly.
The process will lower a function like:
def @relay_fn(%x: Tensor[(10, 10), f32]) {
add(%x, %x)
}
into:
primfn @my_add(a: handle, b: handle, c: handle) {
...
}
def @relay_fn(%x: Tensor[(10, 10), f32]) {
@my_add(%x, %x)
}
This doesn’t account for the secondary need to track output buffers which is something we can argue about how to rectify, the current VM design is an explicit change of calling convention from call nodes to a specialized pseduo-op.
Drawbacks
This is a large refactor and may require a series of refactors and change existing code people understand.
Rationale and alternatives
I believe nearly everyone who has to work on CompileEngine
is unhappy with the current design and due to lack of clear ownership of this piece of code it has become a bit of a dumping ground for any complexity around lowering from Relay to TIR. Splitting this up will enable new features and simplify the code at the cost of some churn and dev cycles.
Prior art
Most compilers allow you to see the entire program during execution the CompileEngine seems like an idiosyncratic system that was designed to enable us to wrap TVM’s compilation API back when there was a hard split between Relay and TIR.
Unresolved questions
- How many PRs do we split this into?
- How much do we refactor at once?
Future possibilities
This should enable us to unify lots of the compilation flow meaning we can share more code across graph runtime, vm, aot, etc. these are out of scope for this RFC but are things worth considering once we break the hard boundary between the layers.