[pre-RFC] Additional Target Hooks

Mousius · July 8, 2021, 3:38pm

Summary

In order to enable flexibility in how individual targets are lowered and built within TVM, this RFC proposes supporting additional hooks on the Target and that the target becomes the central place for such hooks, for example:

TVM_REGISTER_TARGET_KIND("cmsisnn", kDLCPU)
    .set_attr<String>("relay_to_tir", "target.cmsisnn.lower")
    .set_attr<String>("tir_to_runtime", "target.cmsisnn.build");

This defines two new hooks as attributes on the target, referencing functions registered into the central TVM registry. In similar fashion, external generators (currently accessed directly in the compile engine) would be grouped with an appropriate Target as well:

TVM_REGISTER_TARGET_KIND("ethos-n", kDLCPU)
    .set_attr<String>("relay_to_runtime", "relay.ext.ethos-n")
    .set_attr<String>("constant_updater", "relay.ext.ethos-n.constant_updater");

Collecting all targets under the Target functionality and making it clearer which hooks apply to each target.

Motivation

Currently to introduce an external code generator, the entire compilation pipeline must be recreated; this is necessary for some targets but in the case of simply re-using existing libraries or introducing a function call to use for an operator it can become more than is necessary. It also exists outside of the main PrimFunc, meaning it can’t be inspected as part of the entire main graph; this limits the effectiveness of techniques such as memory planning. By introducing the hook relay_to_tir, which is similar to the default lower pass in that it returns TIR, it can be inspected by the memory planner and other analysis passes that only work at the TIR level. If all that is necessary is transforming into a flat call_extern (such is the case for the CMSIS NN Softmax function) then this can be left represented as TIR and be collected by the host code generation.

In the more complex case, we still want to take advantage of memory planning by using relay_to_tir and inspecting the liveness within the TIR graph, but instead want to generate out more complex calls (such as using the CMSIS NN Structures); the tir_to_runtime hook can be used to build our intermediary TIR into a Runtime module similarly to how the existing external code generation works. This allows writing of external code generators that also get the benefits of any intermediary analysis or transformation that TVM offers. Alongside being able to use the analysis passes, code generators can extend from existing host code generators, customising only the generation which is relevant to them and gaining maximum benefit from the existing optimisations made in TVM.

Guide-level explanation

As a user, you can pick from additional hooks to bypass certain behaviours of the Target:

relay_to_tir - Custom lowering direct to TIR
tir_to_runtime - Custom code generation into a runtime module from TIR
relay_to_runtime - Full compilation flow from Relay to a runtime module

To illustrate where the hooks are placed, please refer to the following diagram:

These can be registered on targets using set_attr:

TVM_REGISTER_TARGET_KIND("cmsisnn", kDLCPU)
    .set_attr<String>("relay_to_tir", "target.cmsisnn.lower")
    .set_attr<String>("tir_to_runtime", "target.cmsisnn.build");

TVM_REGISTER_TARGET_KIND("ethos-n", kDLCPU)
    .set_attr<String>("relay_to_runtime", "relay.ext.ethos-n")
    .set_attr<String>("constant_updater", "relay.ext.ethos-n.constant_updater");

Relay → TIR

With this change, this path splits, depending on whether you wanted to generate a full Module or introduce some specific TIR nodes into the code generation flow; the addition of the relay_to_tir hook allows you to write trivial external TIR generators such as calling out to a third party library:

@tvm.register_func("target.woofles.lowering")
def tir_generator(relay_func):
    """A simple TIR generator for testing"""
    ib = tvm.tir.ir_builder.create()
    A = tvm.tir.decl_buffer(shape=(8,8,), dtype=relay_func.params[0].checked_type.dtype)
    B = tvm.tir.decl_buffer(shape=(8,8,), dtype=relay_func.params[0].checked_type.dtype)
    C = tvm.tir.decl_buffer(shape=(8,8,), dtype=relay_func.params[0].checked_type.dtype)
    ib.emit(
        tvm.tir.call_extern('int32', 'woofles', A.data, B.data, 8, 8, C.data)
    )

    prim_func = tvm.tir.PrimFunc([A, B, C], ib.get())
    ir = tvm.lower(prim_func, name=relay_func.attrs["global_symbol"])

    return ir

This is then registered on a target:

TVM_REGISTER_TARGET_KIND("woofles", kDLCPU)
    .set_attr<String>("relay_to_tir", "target.woofles.lowering");

TIR → Runtime

Extending from the above, a second hook is introduced to do further transformations from TIR → Runtime named tir_to_runtime, this bypasses the default target.build.X and instead uses the registered tir_to_runtime build:

runtime::Module BuildWooflesHost(IRModule mod, Target target) {
// ... Custom Code generation here
}

TVM_REGISTER_GLOBAL("target.build.woofles").set_body_typed(BuildWooflesHost);
TVM_REGISTER_TARGET_KIND("woofles", kDLCPU)
    .set_attr<String>("tir_to_runtime", "target.build.woofles");

Reference-level explanation

This functionality is an extension of the existing use of attr::kCompiler to provide a hint that we can use to lookup attached target attribute, the compile engine and code generation flows can choose to store TIR and/or generate runtime modules based on the registered hooks.

Relay to TIR Hook

This can be added into the compile_engine.cc by cross referencing the existing attr::kCompiler with the TargetKind registry:

auto code_gen_name = key->source_func->GetAttr<String>(attr::kCompiler).value();
auto target_kind = tvm::TargetKind::Get(code_gen_name).value();
if (target_kind.defined()) {
    auto map = tvm::TargetKind::GetAttrMap<String>("relay_to_tir");
    std::string custom_lowering = map[target_kind];
    auto lowering_function = tvm::runtime::Registry::Get(custom_lowering);
    cache_node->target = key->target;
    cache_node->funcs = (*lowering_function)(key->source_func, key->target);
    return CachedFunc(cache_node);
}

By placing this where lowering currently takes place, it means minimal changes to executor code generators as they call into Lower in CompileEngine.

TIR to Runtime Hook

Instead of replicating the current external code generation hook, it is proposed that this hook exists in build_module.cc:

auto target_built_mods = FindFuncsWithTargetBuild(lowered_funcs);
auto ext_mods = executor_codegen_->GetExternalModules();
auto extra_mods = ext_mods->Concat(target_built_mods);
ret_.mod = tvm::codegen::CreateMetadataModule(ret_.params, ret_.mod, extra_mods, GetTargetHost(),
                                                executor_codegen_->GetMetadata());

This means the hook is integrated at a higher level and included in the compile flow without executors having to be aware of how these modules exist. See Relay to TIR Hook for how the TargetKind registry would be used.

Relay to Runtime Hook

This would replace the existing relay.ext.<target> lookup in compile_engine.cc, essentially using the same logic as Relay to TIR Hook to cross reference with kCompiler.

Drawbacks

Different hooks are currently dealt with in quite disparate parts of the codebase which are being heavily refactored

Prior art

This is all based upon the existing external code generation infrastructure within TVM by placing additional hooks in the same areas as existing external generation. Instead of replicating this with named functions in the relay.ext. namespace of the function registry it instead begins to follow the pattern outlined as B1 in Target and Attributes by @tqchen and further included in [RFC] TVM Target Specification.

Future possibilities

In future, this approach enables rapid integration of anything that can be represented in TIR into the main compilation graph; this simplifies the transformation process for a multitude of external libraries.

Alongside this, adding further hooks means external code generation can gain benefits from the normal lower and build flow in TVM. This then expands to exposing more granular methods in the driver api to leverage the compiler passes in TVM, similar to how they’ve been exposed in https://github.com/apache/tvm/pull/8110 with lower_primfunc and lower_schedule. This can is then regulated by the normal Target mechanism to route as appropriate.

Refactoring the target splitting logic into build_module.cc alongside any external module generation makes this a first class series of hooks into a simplified compilation flow; this would enable the removal of external generators from executor code generators which currently proxy to compile_engine.cc. Eventually this could also be used for CPU/GPU split as a specialisation of a Target/Targets split.

cc: @jroesch @csullivan @electriclilies @tqchen @mbaret @manupa-arm @grant-arm @ramana-arm

Mousius · July 8, 2021, 3:40pm

Example PR implementing the first hook: Implementation of relay_to_tir target hook by Mousius · Pull Request #8423 · apache/tvm · GitHub

hogepodge · July 19, 2021, 8:34pm

@Mousius would you be willing to lead a discussion on this topic at the upcoming TVM Community meeting, this Thursday July 22 at 9 AM PT?

areusch · August 7, 2021, 12:55am

@Mousius thanks for this RFC and apologies for the long delay. I read this in conjunction with [RFC] Arm® Ethos™-U Integration to try to understand the initial application. I think that should be a sufficient example, but let me know if there are other use cases I should consider.

I discussed this with @jroesch and mbs-octoml at length a couple days ago. Documenting our discussion here.

Overall:

We agree there should be a way to leverage external codegen without recreating the entire compilation pipeline.
We want to ensure that this work is compatible with the ongoing TEcompiler refactor work–specifically, the TE-compiler refactor is now going to move towards unifying Relay → TIR lowering (and later unifying lower down the pipeline) across the Graph, AOT, and VM executors.
To that end, the case for a relay_to_tir hook and a tir_to_runtime hook seems clear. We’d like to clarify the interface of this hook, and propose:
```
relay_to_tir(const IRModule& ir_module, const relay::Function& function) -> (IRModule, GlobalVar)
```
The contract is TVM calls this interface with a read-only view of the IRModule containing function, plus the function in question to lower. The hook implementation should return an IRModule containing one or more functions implementing the lowered Relay function, plus a GlobalVar indicating the symbol name of the “top-level” function of that operator (in case multiple TIR functions are created to implement the operator).

At present, TVM keeps the returned IRModule separate from the remaining lowered code. In the future, as part of the TECompiler refactor, TVM will merge the returned IRModule in with all other TIR functions, handling name conflicts.
For the tir_to_runtime hook, we presume this will follow the existing relay.ext. interface, just it will be specific to the target rather than a compiler attribute marked onto the relay Function.
In terms of user interface: theoretically it should be possible to hand TVM an unannotated Relay function plus a Target which specifies the available CPUs/accelerators, and TVM should leverage its knowledge of schedules to assign functions to devices. Currently, we either specify a mostly-homogenous target or manually mark functions to be run externally. In the future, we’re pondering that the interface could be: either TVM will assign the each function call to a target; or you can override this and mark it manually using a per-call-site or per-function attribute. In this case, the target contained in that attribute is not a composite Target, but instead a shorthand descriptor for one of the pieces of the overall Target. For example, Target could be specified as low_power_cpu: c -mcpu=cortex-m0; inference_cpu: c -mcpu=cortex-m7f, and call sites could be assigned to either low_power_cpu or inference_cpu. Does this sketch of a direction align with how you’d like to enable these target hooks?

Mousius · August 8, 2021, 1:46pm

It’s important to consider it also in conjunction with [RFC] Use CMSIS-NN with TVM by ashutosh-arm · Pull Request #15 · apache/tvm-rfcs · GitHub, as both will provide allocated buffers for the memory planning to inspect. If you consider both at once, alongside default host TECompiler behaviour or in other combinations, it starts to motivate the importance of gathering the TIR nodes for the memory planning passes and likely future others.

This is sensible, one outcome for the hooks work could be that the TECompiler is marked as the default hook for relay_to_tir unless you explicitly override it per-target. By sharing the same interfaces it should mean the actual executors continue to be agnostic as to how the Target is lowered.

areusch:

To that end, the case for a relay_to_tir hook and a tir_to_runtime hook seems clear. We’d like to clarify the interface of this hook, and propose:
relay_to_tir(const IRModule& ir_module, const relay::Function& function) -> (IRModule, GlobalVar)

One of the things that occurs to me now, is that we need to pass the Target into this function as well. CMSIS-NN has specific buffer sizes for different configurations (see: CMSIS-NN Depthwise Buffer) which will need to be inferred from target options such as mattr. Which would make the function signature similar to:

relay_to_tir(const Target& target, const IRModule& ir_module, const relay::Function& function) -> (IRModule, GlobalVar)

This still holds the behaviours you’re interested in, with read-only values being passed in and a fresh return value created. Also, a few questions:

Given the hook already receives the Function which is within the IRModule, we don’t actually need it in many cases? I’m assuming this will be useful for checking if there’s already a similar lowered variant or ensuring name collisions don’t occur?
In the initial design, the function registry was used to refer to the function - do you envisage using a C++ function directly?

I think you meant relay_to_runtime, in which case yes, this will use the signature for the relay.ext. registered functions. The only change is that instead of the registry string concatenated lookup, Target becomes the source of truth.

I’d suggest the attribute isn’t necessary if the partitioning logic can provide a valid Target which is either hardcoded or TVM assigned. Other than that, I agree that the Target specified per-call-site or per-function would be subset of the overall TVM target string given by the user. In the example of a LLVM host with CMSIS-NN and Ethos-U additions, the Target string would look something like ethos-u,cmsisnn,llvm which TVM can use to selectively apply one of ethos-u, cmsisnn or llvm to a given call-site or function. I’d expect manually marking parts of the graph will be part of the TVM selection logic alongside selecting from known schedules.

Mousius · August 24, 2021, 11:35am

Just coming back to this thread, I believe there’s a way to introduce the hooks in a less intrusive way to the LowerTEPass, by placing it just above it in build_module.cc. This should mean each Target can register a relay_to_tir Pass which is ran there rather than having to wire it via LowerTEPass and any PrimFunc will be skipped as they’re not relay functions anymore.

I’ve updated the RFC PR to reflect this: Additional Target Hooks RFC by Mousius · Pull Request #10 · apache/tvm-rfcs · GitHub

What do you think @jroesch / @areusch ?