Hi all, at the moment our Collage (https://github.com/apache/tvm-rfcs/blob/main/rfcs/0062-collage.md) implementation is focussing on standard CUDA ‘backends’: TensorRT, CuDNN, CuBLASS and CUTLASS. We’ve noticed three styles for these backends:
-
Custom runtime::Module flavor: At build time the external codegen function produces a JSON runtime module which is serialized into the overall .so file by export_library. At runtime the deserialized runtime module Init sets up the execution environment (which may involve compilation, tuning, caching, etc), and Run actually triggers execution.
Eg src/relay/backend/contrib/tensorrt/codegen.cc::TensorRTCompiler and src/runtime/contrib/tensorrt/tensorrt_runtime.cc::Run.
-
Built-in flavor: At build time the external codegen function produces TE consisting of an extern call to a shim function built into the TVM runtime. The choice of which extern to call and any additional configuration arguments may be made after tuning. The standard TVM lowering flow is then invoked to compile this to an llvm runtime module which is linked by export_library. At runtime the shim is executed without any further overhead.
Eg src/runtime/contrib/tensorrt/tensorrt_runtime.cc registers relay_to_runtime, and src/runtime/contrib/tensorrt/tensorrt_runtime.cc binds the corresponding shims.
-
CSource flavor: At build time the external codegen function produces a CSource runtime module which is actually compiled by the export_library along with all other ‘DSO exportable’ runtime modules (which currently are just the ‘llvm’ and ‘c’ modules). Every primitive function must implement the packed func calling convention and be marked as publicly visible so that the final DSOLibrary can retrieve it at runtime. Any compilation options, and even which overall compiler should be used (gcc? nvcc?), must be provided to export_libarary, and it is assumed those options are also valid for any other CSource modules accumulated during compilation.
Eg src/relay/backend/contrib/cutlass/codegen.cc::CutlassCompiler and python/tvm/contrib/cutlass/build.py::build_cutlass_kernels.
Please let me know if I’ve missed something there, I’m trying to see the overall pattern based only on a few examples.
Now from Collage’s pov we’d like each backend to be:
- Self-contained: no additional special build steps should be needed after the external codegen machinery is invoked.
- Independent: Different backends can be freely mixed, provided they agree on basic architecture, eg Target(“cuda”).
- Configurable: Any special compilation options should be conveyed via the existing Target machinery and it’s attributes.
- Maximize sharing: Any common library or runtime code should be shared between all primitive functions for the same backend.
- No run overhead: A call to a primitive function should not require any tuning, engine initializaiton or other overhead which can’t be cached between calls.
- Easy deployment: The result of export_library should be copy-deployable to another machine, ie should not contain any implied dependencies to .sos.
I’m sure you see where I’m going with this: for the CSource flavor of backend the user does not know which combination of backends Collage has chosen, and thus may not know which combination of compiler options must be provided to export_libray. Since currently Collage’s only CSource-flavor backend is CUTLASS we can hack our way around this, but I’d like to be on firmer ground.
Andrew Reusch’s ‘Artifact’ and ‘Dependent Library’ pre-RFCs touched on this issue: Introduce Artifact, a container for generated code [µTVM] Capturing dependent libraries of code-generated TIR (initially for use in Model Library Format)
After more fumbling around than I’d care to admit I came up with this approach for CUTLASS in particular which gives us what we need without taking on the full ‘Artifact’ issue. I’ll include pointer into the Collage mega-branch.
-
I extend the ‘external codegen’ Target for “cutlass” with all the options which infuence turning and compilation: https://github.com/mbs-octoml/mbs-tvm/blob/aace6e4739ad044cbb7ebe5eda9ea72a04a6c644/src/relay/backend/contrib/cutlass/target.cc#L46
-
I change the TECompiler and the RealyToTIRTargetHook machinery to ensure that before any custom codegen function or pass is invoked the corresponding ‘external codegen’ Target is current: https://github.com/mbs-octoml/mbs-tvm/blob/aace6e4739ad044cbb7ebe5eda9ea72a04a6c644/src/relay/transforms/target_hooks.cc#L126
-
I require the user to supply both the usual Target(“cuda”) as well as the ‘external codegen’ Target: https://github.com/mbs-octoml/mbs-tvm/blob/aace6e4739ad044cbb7ebe5eda9ea72a04a6c644/tests/python/contrib/test_cutlass.py#L268
-
I change CUTLASS from function-at-a-time to IRModule-at-a-time custom codegen, using the RelayToTIRTargetHook machinery. The main compilation now tunes, produces C, and compiles to a .o with options drawn from the current Target: https://github.com/mbs-octoml/mbs-tvm/blob/aace6e4739ad044cbb7ebe5eda9ea72a04a6c644/src/relay/backend/contrib/cutlass/target.cc#L43 https://github.com/mbs-octoml/mbs-tvm/blob/aace6e4739ad044cbb7ebe5eda9ea72a04a6c644/python/tvm/contrib/cutlass/build.py#L480
-
I introduce a new StaticLibraryNode runtime::Module. It’s job is simply to convey the contents of a generated .o file from the external codegen function to the final export_library compiler invocation via the existing IRModule “external_mods” attribute. https://github.com/mbs-octoml/mbs-tvm/blob/aace6e4739ad044cbb7ebe5eda9ea72a04a6c644/src/runtime/static_library.cc#L45
-
There’s additional shenanigans due to the VM and Graph/AOT build paths differing in their use of Inline, the fact that only ‘main’ can pass through the Graph/AOT build keyhole, finishing some refactoring to make relay.build list-of-targets friendly, making target hooks support both the inlined and outlined convention for Compiler=“foo” functions, centralizing the ‘is dso module’ predicate as a method on runtime::Module, supporting a little bit of metadata on static_library so we can at least confirm the expected primitive functions have been implemented, and so on.
Obviously these are way more changes than I’d like but I think even without Collage it’s a Good Thing to make external codegen more compositional. Not sure how/if this intersects/complements UMA. I can start to peel off PRs but wanted to give the Big Picture first.
Best, -m