[RFC][Relay] TECompiler: Rewrite existing compile engine to match updated compiler flow

jroesch · February 24, 2021, 9:38pm

RFC: TE Compiler

Feature Name: (te_compiler)
Start Date: (fill me in with today’s date, 2020-02-24)
RFC PR: apache/tvm-rfcs#7518
GitHub Issue: apache/tvm#0000

Summary

The goal of this RFC is remove the existing interface between Relay and TIR, the CompileEngine class. The removal of this class will enable the entire program to be compiled as a unified IRModule allowing users to transform multiple kernels and Relay functions simultaneously and give more end-user control over the lowering of TE into TIR.

Motivation

Our motivation is to unify TIR compilation and Relay compilation so that we can uniformly transform and analyze the entire program from graph-level to kernel-level. The old CompileEngine was designed in a completely different era of TVM and is not well suited for current efforts and incoming reactors such as AutoTIR and TensorIR.

The current design compiles Relay primitive functions via a callback into the compile engine. The compile engine compiles each primitive function in complete isolation limiting the ability to analyze or optimize across them. By replacing Relay primitive function calls with TIR primitive function calls that contain the lowered TIR we enable users to customize the compilation flow after lowering instead of providing a fixed compilation pipeline exposed by CompileEngine. Previously the code would be lowered from Relay primitive functions directly to packed functions limiting the user ability to customize what happens after lowering but before runtime.

Guide-level explanation

The high level change will mean that instead of the compile engine being an invisible piece of machinery invoked by the backends such as the graph runtime, VM or AoT it will function as a IRModule to IRModule pass.

For example in the current PR you can lower all Relay “primitive functions” (functions marked with the “Primitive” attribute) directly into TIR by invoking the LowerTE pass on an IRModule.

This means you can simply do:

auto lowered_mod = LowerTE()(module);

This enables:

An intermediate stage in the lowering process where Relay and TIR coexist.
The ability to add passes at this intermediate stage,
For example memory planning which can infer user provided information from TE and the resulting TIR.

The current implementation is a bit more complex then this as we are incrementally refactoring the code as this is a large change and will effect most compilation and runtime flows.

Reference-level explanation

Currently the compile engine is consumed in all Relay runtimes including the interpreter, graph runtime, VM and any AoT efforts going forward.

Our proposed design is to take all current uses of the CompileEngine and replace them with a new pass based wrapper which simply generates all the lowered functions which can then be added back to the module and compiled as a single unit.

This is a complex refactor and requires a few steps, first we will introduce an temporary state where we introduce the new API and level the existing API in place. We will then migrate each current client of the CompileEngine to the new API before deleting the code.

We are starting with a proof of concept by refactoring the GraphRuntimeCodegen to use a newly introduced TE compiler instead of the compile engine directly. In the new flow

The TE/TIR compiler lowers TE in the LowerTensorExpr pass
Replaces relay.Function(attr:primitive) with a function call to the a GlobalVar pointing to a TIR function.
Runs GraphPlanMemory planning as usual.
Finally runs GraphRuntimeCodegen::VisitExpr to lower to graph JSON.

One challenge that remains is the BYOC flow produces runtime modules currently in the compile engine and it might make sense to split these out into a secondary pass which generates the runtime::Modules directly.

The process will lower a function like:

def @relay_fn(%x: Tensor[(10, 10), f32])  {
   add(%x, %x)
}

into:

primfn @my_add(a: handle, b: handle, c: handle) {
    ...
}

def @relay_fn(%x: Tensor[(10, 10), f32])  {
  @my_add(%x, %x)
}

This doesn’t account for the secondary need to track output buffers which is something we can argue about how to rectify, the current VM design is an explicit change of calling convention from call nodes to a specialized pseduo-op.

Drawbacks

This is a large refactor and may require a series of refactors and change existing code people understand.

Rationale and alternatives

I believe nearly everyone who has to work on CompileEngine is unhappy with the current design and due to lack of clear ownership of this piece of code it has become a bit of a dumping ground for any complexity around lowering from Relay to TIR. Splitting this up will enable new features and simplify the code at the cost of some churn and dev cycles.

Prior art

Most compilers allow you to see the entire program during execution the CompileEngine seems like an idiosyncratic system that was designed to enable us to wrap TVM’s compilation API back when there was a hard split between Relay and TIR.

Unresolved questions

How many PRs do we split this into?
How much do we refactor at once?

Future possibilities

This should enable us to unify lots of the compilation flow meaning we can share more code across graph runtime, vm, aot, etc. these are out of scope for this RFC but are things worth considering once we break the hard boundary between the layers.

mbaret · February 24, 2021, 10:14pm

This seems to solve similar problems to the ones I was finding with compile_engine (I was exploring this last year: [RFC] Refactor the compile_engine to expose a Relay -> TE translator). This looks like it will be quite a disruptive change so naturally I’m interested in how we can gracefully handle refactors here while preserving on-going work.

I have a few specific questions:

Is the expected output of this flow a ‘hybrid’ IRModule with a Relay main function and a number of TIR primfuncs? If so, do you think there may be a place for a ‘full TIR’ module instead so that the main function can accurately handle output buffers among other things?
For BYOC, could we in principle provide an interface for external targets to be compiled down to TIR rather than directly to a runtime module? I am primarily considering here external targets that may wish to benefit from static memory planning.
If you wanted to customize the TE->TIR lowering to use a custom scheduling method (like this one: [RFC] 'Cascade' Scheduling), would this expose a component that just lowers the Relay to unscheduled TEs?

I need some more time to read through the WIP PR so may have some further questions after that.

Thanks

junrushao · February 25, 2021, 3:08am

Thanks for the proposal! It is neat that we finally get a clear separation of lowering and compilation in the relay compile engine

I wonder how this change could affect layout rewrite in Ansor? Ansor does some (approximately) optimal layout calculation on TE level, and pushes the layout back to Relay compilation pipeline. This process is a bit complicated and fragile, so would love to ask for your opinion if this kind of transformation, where the low level IR pushes information back to the high level IR, is under the scope of compile engine refactoring?

CC @merrymercy, @comaniac @jcf94 @thierry who have been working on Ansor-Relay integration.

jroesch · March 24, 2021, 8:17am

Will try and respond more in depth tomorrow the PR is now ready for review I cleaned things up quite a bit was a bit distracted by activity at OctoML like raising funding, etc so I just found time to jump back in on this. I would really like to land these changes before we make any further changes to memory planning, heterogeneous execution, etc as this piece of code has honestly become quite a mess due to people just adding code without a serious eye for refactoring (understandable due to the pain but its now time to pay the price ).

The flow will be at the completion of all refactors that all compilation steps will respect the pass manager and you will get an IRModule containing all compilable code. The only gotcha I see currently is we must carry around an Array of runtime::Modules which currently contains all the code compiled by the BYOC flow.
I think in theory we could add the external functions to the IRModule in order to to avoid needing to eagerly generate runtime modules and as long as we had some sort of signature for them we could also uniformly memory plan or analyze them.
I will need to look into more, my thought is as we land the unification effort of AutoTVM, auto-scheduling, etc we should redesign how scheduling is exposed to this part of the compiler to allow compiler users to extend/customize the flow. Once we break the pass apart we could easily allow a hook to run scheduling code.

My feeling is we should do this in a unified way, i.e autotir, customization, etc all use the same scheduling hook. My understanding is correct way to do this is add new scheduling primitives in TensorIR/AutoTIR and then rewrite schedules to use the scheduling primitives to control the scheduling of compute for targets. @junrushao might have more thoughts here.

Sorry for the slow response!

jroesch · March 24, 2021, 8:21am

From scanning the cascading scheduling I think this is inline with what I want, the goal would be to able to say iteratively transform a full Relay program into TIR in order to do ahead of time compilation, or “vertical fusion” of kernels to reduce cross device memory traffic.

manupa-arm · March 24, 2021, 4:48pm

In theory, I think the cascading scheduler is a bit higher-level concept than writing scheduling primitives on a operator basis. In a way, one could see this as a complimentary peer (not necessarily mean it would be beneficial to run both) to the auto-scheduler. Its about choosing the strategy of applying scheduling primitives across a fused-operator TIR module. I guess in this clean flow, would there be a modular (target-dependent) place to add this ? WDYT