[BYOC] Supporting CUTLASS BYOC with Collage

Hi all, at the moment our Collage (https://github.com/apache/tvm-rfcs/blob/main/rfcs/0062-collage.md) implementation is focussing on standard CUDA ‘backends’: TensorRT, CuDNN, CuBLASS and CUTLASS. We’ve noticed three styles for these backends:

  • Custom runtime::Module flavor: At build time the external codegen function produces a JSON runtime module which is serialized into the overall .so file by export_library. At runtime the deserialized runtime module Init sets up the execution environment (which may involve compilation, tuning, caching, etc), and Run actually triggers execution.

    Eg src/relay/backend/contrib/tensorrt/codegen.cc::TensorRTCompiler and src/runtime/contrib/tensorrt/tensorrt_runtime.cc::Run.

  • Built-in flavor: At build time the external codegen function produces TE consisting of an extern call to a shim function built into the TVM runtime. The choice of which extern to call and any additional configuration arguments may be made after tuning. The standard TVM lowering flow is then invoked to compile this to an llvm runtime module which is linked by export_library. At runtime the shim is executed without any further overhead.

    Eg src/runtime/contrib/tensorrt/tensorrt_runtime.cc registers relay_to_runtime, and src/runtime/contrib/tensorrt/tensorrt_runtime.cc binds the corresponding shims.

  • CSource flavor: At build time the external codegen function produces a CSource runtime module which is actually compiled by the export_library along with all other ‘DSO exportable’ runtime modules (which currently are just the ‘llvm’ and ‘c’ modules). Every primitive function must implement the packed func calling convention and be marked as publicly visible so that the final DSOLibrary can retrieve it at runtime. Any compilation options, and even which overall compiler should be used (gcc? nvcc?), must be provided to export_libarary, and it is assumed those options are also valid for any other CSource modules accumulated during compilation.

    Eg src/relay/backend/contrib/cutlass/codegen.cc::CutlassCompiler and python/tvm/contrib/cutlass/build.py::build_cutlass_kernels.

Please let me know if I’ve missed something there, I’m trying to see the overall pattern based only on a few examples.

Now from Collage’s pov we’d like each backend to be:

  • Self-contained: no additional special build steps should be needed after the external codegen machinery is invoked.
  • Independent: Different backends can be freely mixed, provided they agree on basic architecture, eg Target(“cuda”).
  • Configurable: Any special compilation options should be conveyed via the existing Target machinery and it’s attributes.
  • Maximize sharing: Any common library or runtime code should be shared between all primitive functions for the same backend.
  • No run overhead: A call to a primitive function should not require any tuning, engine initializaiton or other overhead which can’t be cached between calls.
  • Easy deployment: The result of export_library should be copy-deployable to another machine, ie should not contain any implied dependencies to .sos.

I’m sure you see where I’m going with this: for the CSource flavor of backend the user does not know which combination of backends Collage has chosen, and thus may not know which combination of compiler options must be provided to export_libray. Since currently Collage’s only CSource-flavor backend is CUTLASS we can hack our way around this, but I’d like to be on firmer ground.

Andrew Reusch’s ‘Artifact’ and ‘Dependent Library’ pre-RFCs touched on this issue: Introduce Artifact, a container for generated code [µTVM] Capturing dependent libraries of code-generated TIR (initially for use in Model Library Format)

After more fumbling around than I’d care to admit I came up with this approach for CUTLASS in particular which gives us what we need without taking on the full ‘Artifact’ issue. I’ll include pointer into the Collage mega-branch.

  1. I extend the ‘external codegen’ Target for “cutlass” with all the options which infuence turning and compilation: https://github.com/mbs-octoml/mbs-tvm/blob/aace6e4739ad044cbb7ebe5eda9ea72a04a6c644/src/relay/backend/contrib/cutlass/target.cc#L46

  2. I change the TECompiler and the RealyToTIRTargetHook machinery to ensure that before any custom codegen function or pass is invoked the corresponding ‘external codegen’ Target is current: https://github.com/mbs-octoml/mbs-tvm/blob/aace6e4739ad044cbb7ebe5eda9ea72a04a6c644/src/relay/transforms/target_hooks.cc#L126

  3. I require the user to supply both the usual Target(“cuda”) as well as the ‘external codegen’ Target: https://github.com/mbs-octoml/mbs-tvm/blob/aace6e4739ad044cbb7ebe5eda9ea72a04a6c644/tests/python/contrib/test_cutlass.py#L268

  4. I change CUTLASS from function-at-a-time to IRModule-at-a-time custom codegen, using the RelayToTIRTargetHook machinery. The main compilation now tunes, produces C, and compiles to a .o with options drawn from the current Target: https://github.com/mbs-octoml/mbs-tvm/blob/aace6e4739ad044cbb7ebe5eda9ea72a04a6c644/src/relay/backend/contrib/cutlass/target.cc#L43 https://github.com/mbs-octoml/mbs-tvm/blob/aace6e4739ad044cbb7ebe5eda9ea72a04a6c644/python/tvm/contrib/cutlass/build.py#L480

  5. I introduce a new StaticLibraryNode runtime::Module. It’s job is simply to convey the contents of a generated .o file from the external codegen function to the final export_library compiler invocation via the existing IRModule “external_mods” attribute. https://github.com/mbs-octoml/mbs-tvm/blob/aace6e4739ad044cbb7ebe5eda9ea72a04a6c644/src/runtime/static_library.cc#L45

  6. There’s additional shenanigans due to the VM and Graph/AOT build paths differing in their use of Inline, the fact that only ‘main’ can pass through the Graph/AOT build keyhole, finishing some refactoring to make relay.build list-of-targets friendly, making target hooks support both the inlined and outlined convention for Compiler=“foo” functions, centralizing the ‘is dso module’ predicate as a method on runtime::Module, supporting a little bit of metadata on static_library so we can at least confirm the expected primitive functions have been implemented, and so on.

Obviously these are way more changes than I’d like but I think even without Collage it’s a Good Thing to make external codegen more compositional. Not sure how/if this intersects/complements UMA. I can start to peel off PRs but wanted to give the Big Picture first.

Best, -m

1 Like

cc @cgerum @MJKlaiber @PhilippvK @r.stahl and others on the UMA project for visibility

Thanks Andrew. Just noticed https://github.com/apache/tvm-rfcs/pull/60 has been updated so I’ll read again now. (I don’t do very well following all the comment back and forths, sorry.)

I gave this a proper read through now. I think initially I got distracted by enumerating all of the various runtime::Module arrangements, which I agree are numerous and somewhat addressed by Artifact.

However I don’t think this quite addresses your immediate need, which is to configure the downstream C compiler for CSource modules. That actually reminds me more of [µTVM] Capturing dependent libraries of code-generated TIR (initially for use in Model Library Format) where we prefer to have a way to signal to a downstream pre-compile script that “hey, we require you to link this module against library abc.”

We could consider extending that mechanism to support explicit compiler flags. At present, the consumer of such metadata would be the Project API. We would need to figure out how to consume those in export_library.

cc @mehrdadh @Mousius @alanmacd @leandron

Thanks for (another!) post reference.

I also implemented an approach where the CModuleNode could capture additional compiler flags, and I just naively appended them all in export_library in preparation for the final compile/link invocation. However it seemed very non-compositional to assume the flags would make sense when combined in that way, so I decided simply eagerly compiling to a .o and letting the linker figure it out was cleaner, just as for every other build system I’ve ever used.

Note that I’ve verified TVM correctly carries the “external_mods” attribute through compilation all the way to the final ‘build metadata module’ stage and then on to export_library. So it was trivial for the target_hooks Pass for CUTLASS to include it’s built .o in the result. No additional datastructure or plumbing was required, the only weird thing is target_hooks uses the name “RelayToTIR” but there’s no TIR here.

Also note, in principle if CUTLASS also required linking to some .so then it could also include that in the “external_mods” attribute.

(Last year Lily tried making “exetrnal_mods” a first class IRModule field since it is obviously so fundamental to the external codegen compilation flow, but it appeared others did not see the same thing. Perhaps we’re getting closer to convergence on that front?)

on the external module part, the remaining thing is the bi-directional serializability of the runtime.Module(they are not as things like DSO is only one directionally readable).

As a result the runtime.Module are attached as optional attributes, which served the same purpose, but not exactly getting things

A better move would be a Artifact style interface that captures the compiled artifacts, while still being bi-directional serializable. Then runtime.Module becomes the interface just for loading and execution (but not exportation).

The rough state transition will look like

IRModule( with extern Artifacts) => Build=> Artifact
Artifacts=>Save =>Load=>Artifact
Artifcast=>Export => DSO
DSO=>Load =>RT.Module
Artifact(JITtable) => JIT=> RT.Module