This is a discussion post to propose the directions we can work on in the relax-based pipeline. As we move towards the new pipeline for the models, we will start to adopt an IRModule centric approach:
- Bring models into an IRModule that contains both relax and tensor program functions.
- Enable customizable pipelines that rewrites the graph, partial lowering of operators.
- Empower downstream projects to add necessary customizations on any part of the compiler pipeline, maintaining performance.
To make this flow work, one main requirement we should enable is a default pipeline that is reasonably simple, while performant on most of the cases. Such default pipeline would serve as a basis for further iterations. Having a performant default pipeline reduces the burden on optimizing end to end workflow and enables customization development to focus on any region of the program.
Admittably, due to the rich growth of hardware specific optimizations, very likely such default pipeline will need depend on the target. The main goal of the target specific pipeline is as follows:
- G0: Directly readable and can be copy-pasted edited per target basis, without having to worry too much about disruption from another target specific pipeline.
- G1: Pick up the common peformant ops on the platform
- G2: Ensure coverage of most operators
Proposed Structure
Based on these goals, let us reorganize the relax/backend folder into the following structure:
- In the backend folder, add a subfolder for each target of interest, this enables G0. Developers can copy paste pipeline.py from this folder.
- Include key library dispatches(when available) to meet G1
- Leverage TIR dispatches to ensure G2
relax/backend/cuda
# exposes relax.backend.cuda_default_pipeline() which returns a list
# needed by cuda, developers can
- pipeline.py
- cudnn.py
- cutlass.py
- flashinfer.py
relax/backend/gpu_generic
# gpu generic pipelines for less common target like vulkan/webgpu
# can contain code from relax/backend_tir
- pipeline.py
relax/backend/opencl
- pipeline.py
- clml.py
Here is an example code of relax/backend/cuda/pipeline.py
# relax/backend/cuda/pipeline.py
from . import cudnn
from . import cutlass
from . import flashinfer
def library_dispatch_passes():
return [
backend.DispatchSampling(),
backend.DispatchSortScan(),
cudnn.CUDNNDispatch(),
cutass.CutlassDispatch(),
flashinfer.FlashInferDispatch()
];
def legalize_passes():
return [
tvm.relax.transform.LegalizeOps(),
tvm.relax.transform.AnnotateTIROpPattern(),
tvm.relax.transform.FoldConstant(),
tvm.relax.transform.FuseOps(),
tvm.relax.transform.FuseTIR(),
tvm.dlight.ApplyDefaultSchedule(
tvm.dlight.gpu.Matmul(),
tvm.dlight.gpu.GEMV(),
tvm.dlight.gpu.Reduction(),
tvm.dlight.gpu.GeneralReduction(),
tvm.dlight.gpu.Fallback(),
)
];
def finalize_passes():
# it is not necessary to define finalize passes for each target
# a default one can be found backend.generic.finalize_passes()
return [
transform.RewriteDataflowReshape(),
transform.ToNonDataflow(),
transform.RemovePurityChecking(),
transform.CallTIRRewrite(),
transform.StaticPlanBlockMemory(),
transform.RewriteCUDAGraph(),
transform.LowerAllocTensor(),
transform.KillAfterLastUse(),
transform.LowerRuntimeBuiltin(),
transform.ComputePrimValue(),
transform.VMShapeLower(),
transform.AttachGlobalSymbol(),
];
def default_pipeline():
@tvm.transform.module_pass(opt_level=0)
def _pipeline(mod: tvm.ir.IRModule, _ctx: tvm.transform.PassContext):
seq = tvm.transform.Sequential(
library_dispatch_passes() +
legalize_passes() +
finalize_passes()
)
mod = seq(mod)
return mod
return _pipeline
While it is possible to arbitrary customize the relax compilation pipeline, based on our previous experiences, we structure it into three stages for mosgt common uses cases:
- library_dispatch_passes: Ensure library dispatch of performance of key ops and leverage key operators (gemm, conv2d) and libs with domain specific impl (sort, scan). Performance is the primary goal here.
- legalize_passes: Leverage codegen to cover everything when possible. Coverage is the primary goal here.
- Note that certain ops like sort/scan are still dependent on GPU/CPU, in such cases, they are still categorized into library_dispatch_passes
- This still allows us to customize op specific legalize strategy if needed
- finalize_passes: Run all the finalize shape lowering, memory planning etc.
The three pass stages allow customization to be easily inserted in between. We also do not need to implement all the passes. For example, most common platforms can leverage backend.generic.finalize_passes
, however we might bring domain specific passes on platforms like cuda. Some examples of possible customizations:
- Insert BYOC before library_dispatch to try out new patterns
- Insert parallel sharding and pipeline partition before library_dispatch
- Replace library_dispatch_passes with our own version
- Replace legalize_passes with a tuning based pipeline
Admittably, having structure can introduce some duplications, it also effectively allow downstream developers to copy-paste-edit and play with the structure either in-tree or out of tree. It also allows each backend to be developed in their own pace and introduce vendor specific dispatch when needed.
Notes for Actions
- A0: relax/backend/cuda:
- Establish the cuda backend pipeline and ensure dispatch along with tir codegen
- A1: relax/backend/gpu_generic:
- Establish the folder for generic webgpu target
- relax/backend_tir can likely go to here
- Validating if the pipeline setup makes sense:
- Ensure down stream key applications(e.g. MLC LLM) can reuse most of the pipeline
- Cross check a broader set of models
- Other backends of interest