Targets, CompilationConfig and Collage

mbs-octoml · April 19, 2022, 11:59pm

Hi all, though our Collage work (https://github.com/apache/tvm-rfcs/blob/main/rfcs/0062-collage.md) is pretty self contained, I have found some changes to Target make both the control of available ‘backends’ and the book-keeping for candidate partitions much easier. Since those are global changes, and given the discussion on CompilationConfig is still going ([pre-RFC] Compilation Configuration Representation - #54 by areusch), I figured it’s best to check in here to make sure I’m heading in the right direction.

Roughly, Collage needs:

A way to convey which BYOC backends are available for implementing partitions.
A way to associate a BYOC backend with a candidate partition.

The approach I’ve taken in the prototype (https://github.com/mbs-octoml/mbs-tvm/tree/mbs-collage-sketch) is:

Allow all TargetKinds to have a “compiler” String attribute. (I include it in the TVM_REGISTER_TARGET_KIND macro). With that the user can express, say:
```
 Target("cuda -arch=sm_80 -compiler=tensorrt")
```
which is distinct from:
```
 Target("cuda -arch=sm_80 -compiler=cutlass")
```
which are both considered specializations of:
```
  Target("cuda -arch=sm_80")
```
Collage itself can also just record the Target for candidate partitions, since that object is now sufficient to determine all downstream processing.
Allow the ‘target’ argument to the various build entry points to also be a list (in addition to a dict for the legacy heterogenous case, or a single target for the homogenous case).
Centralize all target & target_host handling in the existing CompilationConfig, using Array<Target> as the generic representation of ‘bag-o-targets’ which the CompilationConfig class is responsible for validating and canonicalizing.
When PlanDevices needs to know the Target to associate with a particular DLDeviceType it defers to the CompilationConfig. That class finds the least-specialized available Target. So in the above example, kDLCUDA would map to Target("cuda -arch=sm_80").
Some cleanup of the Python target handling code then falls out naturally.

An alternative design is to layer the Collage notion of ‘backend’ on top of targets, and introduce some new entry point or convention by which the user can convey that. However, I went with the above approach because it seemed a graceful extension to the existing heterogenous target handling, and it elegantly ties targets and BYOC backends together. After all, it does not make sense to try tensorrt on a non-cuda target, and so on.

Let me know what you think. I can peel out a PR from the prototype if that would help, but honestly I don’t think the actual code changes will be very informative.

Best, -Mark

mbs-octoml · April 20, 2022, 12:01am

@Mousius, flagging to you since I want to make sure I help rather than hinder your efforts.

Mousius · April 21, 2022, 2:21pm

Looking at what you’ve posted I think this isn’t too far from how we’ve implemented our Targets @mbs-octoml, there’s a few differences I can highlight.

mbs-octoml:

Allow all TargetKinds to have a “compiler” String attribute. (I include it in the TVM_REGISTER_TARGET_KIND macro). With that the user can express, say:
 Target("cuda -arch=sm_80 -compiler=tensorrt")

For our newer Targets we’ve gone the route of having List[Target] in priority order which we’re exercising via tvmc:

github.com

apache/tvm/blob/a6ef5af1587c71dc69d710058b95f8baa9c6cc4d/apps/microtvm/ethosu/run_demo.sh#L146-L146


python3 -m tvm.driver.tvmc compile --target=ethos-u,cmsis-nn,c \

Which is sort of equivalent to this:

with tvm.transform.PassContext():
    host_target = tvm.target.Target("c")
    lib_target = tvm.target.Target("cmsis-nn", host_target)
    accel_target = tvm.target.Target("ethos-u", host_target)
    targets = [accel_target, lib_target, host_target]
    exe = relay.build(mod, target=targets)

The order specified is the order we greedily partition the graph for the various Targets, so if you want to find the least special Target for a device you go looking through the priority list for a given device_type. Target is the “end result” of the TVM compilation process rather than a composite of different layers of outputs. That seems to be similar to the goals you’re trying to achieve with -compiler through a slightly different set of semantics?

Following from the above, currently we have to start off representing the additional two Targets (cmsis-nn and ethos-u) as external compilers and use a custom registry in tvmc to partition for them:

github.com

apache/tvm/blob/a6ef5af1587c71dc69d710058b95f8baa9c6cc4d/python/tvm/driver/tvmc/composite_target.py#L50-L58


REGISTERED_CODEGEN = {
    "compute-library": {
        "config_key": None,
        "pass_pipeline": partition_for_arm_compute_lib,
    },
    "cmsis-nn": {
        "config_key": "relay.ext.cmsisnn.options",
        "pass_pipeline": partition_for_cmsisnn,
    },

After partitioning they’re kCompilers on graph nodes until RelayToTIR finally makes them into Targets, which is a bit of journey to get to and limits what we can do with them. Ideally there’d be no more kCompiler and we could use the Targets directly. Introducing List[Target] would therefore help solve this as the Target can have a registered FTVMPartitioner function added alongside FTVMRelayToTIR, such as:

TVM_REGISTER_TARGET_KIND("cmsis-nn", kDLCPU)
    .set_attr<FTVMRelayToTIR>("RelayToTIR", RelayToTIR())
    .set_attr<FTVMTIRToRuntime>("TIRToRuntime", TIRToRuntime)
    .set_attr<FTVMPartitioner>("Partitioner", Partitioner);

Which I assume you can re-use as a hook for Collage if you want to ask a Target what nodes it can partition?

I think we run into the same issues with MetaScheduler serialization if we use either a List[Target] or CompilationConfig here rather than a composite Target? We’d need to unstick that in isolation to move forwards here, I’m in favour of an explicit List[Target]as a solution rather than composite Targets though.

mbs-octoml · April 22, 2022, 2:56pm

Thanks so much Chris, sounds like we’re pulling in the same direction.

I much prefer your already established convention Target("tensortrt") to my Target("cuda -compiler=tensorrt"). I’ll switch to that in the prototype branch and come back with what I’ve learned. At first blush the challenges are:

Before partitioning, I need to know when the target’s name corresponds to a kCompiler label, since Collage uses that to directly retrieve the pattern table from the global registry.
After partitioning, I’m hazy on how we convey target options to a BYOC build. Eg what options are used to build generated DNNL code, or generated CUTLASS code? Both of those may become obvious once I try making the change.

I kindly request the community to try to reach consensus on the naming and representation for what is currently in CompilationConfig so that my changes in there are not accidentally controversial.

Best wishes, -Mark

Mousius · April 25, 2022, 10:44am

Good to hear we’re aligned @mbs-octoml I’ll try to answer your questions on our implementation and illuminate the direction I was heading with this pre-Collage.

This is something we’ve wrapped in tvmc by creating the aforementioned registry and then invoking the partitioners ahead of relay.build (partition_function applies the partitioner to mod):

github.com

apache/tvm/blob/d2db9cb0d839e32778f461b77e59f6418282a511/python/tvm/driver/tvmc/compiler.py#L283-L290


for codegen_from_cli in extra_targets:
    codegen = composite_target.get_codegen_by_target(codegen_from_cli["name"])
    partition_function = codegen["pass_pipeline"]


    if codegen["config_key"] is not None:
        config[codegen["config_key"]] = codegen_from_cli["opts"]
    with tvm.transform.PassContext(config=config):
        mod = partition_function(mod, params, mod_name=mod_name, **codegen_from_cli["opts"])

I was hoping we could use a List[Target] of some variety (CompilationConfig would also work) in relay.build with a partitioner registered on the Target such as:

TVM_REGISTER_TARGET_KIND("cmsis-nn", kDLCPU)
    .set_attr<FTVMPartitioner>("Partitioner", PartitionForCMSISNN)
    .set_attr<FTVMRelayToTIR>("RelayToTIR", RelayToTIR())
    .set_attr<FTVMTIRToRuntime>("TIRToRuntime", TIRToRuntime);

Then simply looping over them in BuildRelay.

This is a symptom of the same architectural issue as above, taking the same portion of the code:

github.com

apache/tvm/blob/d2db9cb0d839e32778f461b77e59f6418282a511/python/tvm/driver/tvmc/compiler.py#L283-L290


for codegen_from_cli in extra_targets:
    codegen = composite_target.get_codegen_by_target(codegen_from_cli["name"])
    partition_function = codegen["pass_pipeline"]


    if codegen["config_key"] is not None:
        config[codegen["config_key"]] = codegen_from_cli["opts"]
    with tvm.transform.PassContext(config=config):
        mod = partition_function(mod, params, mod_name=mod_name, **codegen_from_cli["opts"])

The arguments for the external code generators are collected and if there’s a PassContext key defined (codegen["config_key"]) we populate that key with the acquired values. This means PassContext contains the Target options until we eventually pick this back up in our implementation:

github.com

apache/tvm/blob/d2db9cb0d839e32778f461b77e59f6418282a511/src/relay/backend/contrib/cmsisnn/compiler_attrs.cc#L47-L51


CMSISNNFlags GetCompilerFlags(const tvm::transform::PassContext& ctx) {
  auto cfg = ctx->GetConfig<CMSISNNCompilerConfig>("relay.ext.cmsisnn.options");
  if (!cfg.defined()) {
    return kNoExt;
  }

I was hoping we could remove kCompiler entirely in favour of Target so we can just use cmsis_nn_target->GetAttr<String>("mcpu").

Hope that helps!

mbs-octoml · April 25, 2022, 5:38pm

One fly in the ointment is that Collage cannot use the ‘partition_for_toolchain’ functions. Instead it must work from the get_pattern_table(“toolchain”) directly so that it can explore all the mixing and matching between multiple active backends. But I still think your Target(“toolchain”) convention can work, I just may have to do some fiddling with the TargetKind. Please stand by for further transmission.

mbs-octoml · April 28, 2022, 11:01pm

Draft PR is https://github.com/apache/tvm/pull/11173. @Mousius would be very helpful if you could take a quick look before I go too far getting all new unit tests written and existing ones fixed. Very likely to have broken things on the python side since it’s hard to tell the difference between internal and external APIs. I couldn’t figure out how to do it without adding a new “is_external_codegen” attribute, which is analogous to the new functional attributes you added.

Mousius · May 3, 2022, 3:37pm

Thanks for the code @mbs-octoml, I had a look and it seems similar to the tvmc logic which we can likely replace with Target.canonicalize_target_and_host and hopefully continue to bring alignment to the various interfaces

What’s missing from my understanding is how the tensorrt Target interacts with the cuda Target inside their individual compilations, potentially I missed it? I think this is where all these new methods come into their own?

As a follow up to this, if we follow the Target Hooks RFC to it’s conclusion, am I correct in thinking we should be able to infer isExternalCodegen from the presence of the RelayToRuntime hook?

mbs-octoml · May 3, 2022, 5:35pm

Good good. The crossover from, eg, Target(“tensorrt”) to Target(“cuda -all -my -flags”) is hard to see because it’s not there There’s two gaps. First In te_compiler.cc LowerExternalFunctions I think we are missing a With<Target> or other plumbing to establish the target already implied by the function’s virtual device is available to the external pass or codegen function. Secondly, CollagePartition uses Target::IsExternalCodegenFor to check that a partition for Target(“tensorrt”) is compatible with the Target(“cuda -all -my -flags”) already assigned to the sub-graph by PlanDevices. However I’ve not yet moved PlanDevices to be before CollagePartition.

But if those were addressed then the crossover would be via the regular handling of external codegen within the TECompiler machinery, and I don’t think we need anything more explicit than that.

[edit] Oh, forget your second question. Yes, I went through all the pass hooks and realized a) we can probably look for one of the more specific kind attributes, b) collage is currently compatible with target hooks since they trigger based on “Compiler” attributes anyway, and c) when we are ready to directly assign targets (or virtual devices) to functions instead of “Compiler” names it is trivial to do so inside the collage. So I think we’re all nicely lined up.

mbs-octoml · May 4, 2022, 1:25am

https://github.com/apache/tvm/pull/11173 is now ready for review, though I’m still working through ci failures.