[Discussion] Refactoring the build module process

yzh119 · July 3, 2023, 4:06am

Motivation

Currently, our build function has several stages (https://github.com/apache/tvm/blob/fb64be3f7807df18c2df6ebf5e68178e564ab0b4/python/tvm/driver/build_module.py#L140-L302):

Lower the module (could be te.Schedule, tir.PrimFunc or IRModule) before being built into target.
Annotate modules with targets
Target aware lowering process (TIRToRuntime)

Limitations

However, such design has its limitations, and it hinders some of our development:

Some of the passes in the lower process have to be hardware aware (e.g. BF16legalizeCompute and FP8LegalizeCompute should know whether the target has native bf16/fp8 support or not.
Cannot bind different PrimFunc’s in an IRModule to different targets.

On-going efforts

Considering most of the passes in the lowering process is target-agnostic, it should be safe to make the lowering function aware of targets. https://github.com/apache/tvm/pull/15183 will make this refactor.
Refactor the lowering flow to make it more flexible (https://github.com/apache/tvm/pull/14985 might be related).

A more fundamental question is if we still need to make lowering a standalone process, if we make the lowering function target-aware, then it seems not necessary to decouple the lowering function with later passes in TIRToRuntime. I think the only reason we kept the lowering function is to be compatible with te.Schedule, which needs to be converted to IRModule first.

Possible next steps could be:

Merge the passes in the lowering function and TIRToRuntime function.
In build function, refactor the three-stage lowering process into a single flow.
Mark lower function as obsolete and should only kept them for legacy te.Schedule.

cc @Lunderberg @tqchen @junrushao @masahi

junrushao · July 3, 2023, 4:12am

Note that as a versatile compiler, TVM is usually used to do multiple different tasks, including building for different targets with different set of passes. In this case, I believe a Unity-style pipeline API would be quite helpful.

Ref: https://github.com/mlc-ai/relax/blob/mlc/python/tvm/relax/pipeline.py#L59

Lunderberg · July 4, 2023, 3:32pm

Yup, this PR is definitely related, and primarily exists to solve Limitation #2. The long-term goal is to enable scheduling that decides which regions of a primfunc to delegate to different devices (e.g. annotate a statement as being executed on a specific device, then lower into a compute kernel), with short-term gain of simplifying the lowering flow.

Regarding tvm.lower, I do like having the separation of target-agnostic optimizations (e.g. algebraic simplification, loop unrolling) from target-specific optimizations (e.g. bf16 legalization). It can be useful to inspect the lowered TIR, as it often represents the simplest form of lowering, lacking both the high-level information for scheduling, and the low-level target-specific information.