[pre-RFC] Compilation Configuration Representation

Thanks for the discussions. I think it is a good opportunty to discuss how can we flow target information through the compilation. Putting some fruits of thoughts

How to flow target information through compilation

One of the main design goal that we want to move towards to is this ability to incrementally transform the code(some of the transformations may not be done in the official build pipeline). Take BYOC as an example, in the future we might invoke a custom pass that slices out a subgraph and generate a function that requires a specific target lowering(e.g. CUDA). The diagram below from TensorIR blitz course shows one example of such a flow:

In summary, there can be two goals:

  • G0: Ability to config a single standard compilation path.
  • G1: Ability to enable incremental customization (via python API), attach constraints(such as BYOC) and then send back to the build function for further lowering.

G0 is certainly sufficient for some of the usecases like tvmc. However, it is also important for us to take inspiration, and think more about making G1 as a first class citizen. A natural consequence of G1 is that we will need to preserve certain “target-constraint” information in IRModule(so previous transformations’s decision are self-contained), either as attr of a function(e.g. this function have to be compiled in CUDA), or IRModule.

It would be great for us to collectively think about a way on how to standardize for G1 while still have the ability to support G0.

CompilationConfig and Composite Target

Back to the CompilationConfig itself. I agree with @zxybazh that it looks quite like a special case of composite target and it is useful to discuss whether or not we can simply merge it as a structured Target.

Coming back to the definition of target, if we look at LLVM’s target triple, -arch-subarch-os-vendor-env-object format. We can find that it also contains runtime choice information like ABI for the libc, OS type and so on. So one could argue that choices like tvm_runtime type, packed function API can be part of a composite target (although they do not need to be in the leaf “c”).

The advantage of having a CompilationOption class:

  • It is a structured class with explicit fields

The advantage of having making CompilationOption as a composite Target

  • We still have structured fields with target configurations
  • We get the benefit of being able to tag, and record target
  • CompilationOption can appear as a field of sub-target of something else. Imagine that we need to offload a subgraph to another customized compilation, which may needs its own specification of the heterogenous “targets”.
  • Same API argument(Target) for both graph level compilation and operator level compilation.

Hi @tqchen and @zxybach,

cc : @mbaret

What is a Composite Target ?

TVM being a multi-target compiler, it would be a bit confusing to use a Array of Targets as another Composite Target – I think its the terminology what is confusing here.

A composite target sounds like a target that codegen intimately in a single codegen path for different devices rather than a structure that is used by TVM to trigger different codegen flows. I think we all agree we need a way to communicate this Options/Flags through out the lowering but I personally would not be in favor of attaching this to a (Composite) target – that result in overloading the term “Target”.

I believe we can still do it as the target is part of the CompilationConfig

CompilationConfig is supposed to contain attributes that are not specific to a target. Thus, they would be still accesible in the IRModule. Wont they ?

This could also be said with respect to CompilationConfig being available for both graph level compilation and operator level compilation – just that “targets” are part of CompilationConfig.

In a summary what we need to discuss is :

  • What is the best term to use to structure that holds information/flags that are target independent and hold the set of targets in the same time?
  • Moreover, it would be great to reserve the term “Composite” target to a target that intimately codegen to multiple devices without divergence in the compilation pathway.

Thanks for the discussions. To begin with, I am not that attached to the particular choice of name. We can for example, decide to introduce another target kind (“hetero-target”, “myawesome-target”, “platform”, “CompilationOption”) whose attr fields matches exactly like CompilationOption.

I think our discussion boils down to the following quenstion

What can be called a “Target” in tvm

Intuitively, to many users, target refers to the “target platform” or environment that they want to run the program on. In a typical clang target triple, the following elements can be part of a target:

  • ISA (x86, arm, riscv)
  • runtime library (musl, libc)
  • operation system env (windows, linux)
  • vendor

Of course in most of the settings here target refers to a single device, usually with a single codegen path. These are targets at the leaf level.

However, as we start to build compilers for ML. The “target” in users’ mind is different. For example, I want to run my program as fast as possible on aws/c4.4xlarge, or nvidia/jetson-nano. Some of these “targets” already involves multiple codegen path(host code and device code). When we start to involve graph or vm for the high level program driver, the vm/graph/aot choice is another codegen path on the driving path of the program.

As the field evolves, the concept of “target” can change further. Right now we are talking about a single SoC with multiple devices. What if we develop an interest in deploying onto the following distributed environment.

- machine0:
   - host: x86
   - vdevice0: cuda
- machine1:
   - host: arm
   - vdevice0: vulkan

We might also interested in the following byoc customization where we offload part of the computation to byoc-myawesome-cuda strategy, which needs a self-contained specification of host and library targets that makes use of cuda-graph runtime. We want to embed it in a vm runtime that invoke the byoc-myawesome-cuda as an opaque function.

- host: x86
- vdevice0: byoc-myawesome-cuda
    - host: x86
    - runtime: cuda-graph
    - vdevice0: cuda
    - library: tensor-rt
- vdevice1: cuda
- runtime: vm

Can we call the above descriptions as “target”? From a UX’s perspective they certainly can be called target. Since from user’s perspective it is a specification of “target environtment”. In the context of machine learning they certainly can usually go beyond a single codegen path.

Another thing to note here is that some of these examples requires a level of compositionality that goes beyond two-level(target then compilation-option). In the context of multi-machine setting, the setting per machine roughly maps to CompilationOption being used here. Similarly, in the case of byoc-myawesome-cuda, vdevice0 itself would benefit from its own runtime specification. Another concept(another target kind) is needed to introduce another concept in order to support the top-level composition.

UX Benefit of a Target – Tagging

Besides the benefit of the compositionality, one major UX benefit of target is the ability to tag. It can be really complicated to manually specify a compositional compilatin option. In most cases, we want users to directly leverage pre-built tags. For example, build for nvidia/jetson-nano:cuda, build for aws/c4.4xlarge, build for arm/soc-name:aot(that directly implies unpacked_api). These tags create short hands for us to setup the compositional configurations.

The ability to let build function takes in tags that quickly maps to both codegen, runtime, and library configurations would greatly improve overall user experiences. Making CompilationOption (or whatever we decided to call it) a Target would allow us to reuse this feature effectively and recursively.

Discussions

The main discussion point here is what is the scope of target. As we can see that:

  • A0: On one hand, we can say that the configuration strictly follows a two-level structure. Where target is on the leaf, specifies a single codegen path. While we use a separate name for the top-level compositions.
  • A1: On the other hand, we can see the need of:
    • More than two levels of compositions
    • The UX need to reuse tagging mechanism and simplify users’ inputs to the compiler.

From a two-level compositional view. Personally I think reusing Target for CompilationOption is not strictly more complicated, modulo the right kind naming. While the needs in ML can certainly go beyond that. Which makes me think going for target compositionality is not a bad idea.

I agree with @tqchen that improving composite targets could be more beneficial and general. We (with @junrushao and @zhiics) previously attempted to improve the target system to allow more flexible attributes, such as a pass sequence / runtime / etc specifically for the target, which is very similar to what TQ illustrated and what this RFC proposed, but found that it’s not an easy task due to the current target system implementation.

Meanwhile, the concept of compilation configuration has been used for some BYOC backends already, but they are currently relying on PassContext. For example, TensorRT codegen takes the configuration from PassContext during relay.build:

mod, config = partition_for_tensorrt(mod, params)
target = "cuda"
with tvm.transform.PassContext(opt_level=3, config={'relay.ext.tensorrt.options': config}):
    lib = relay.build(mod, target=target, params=params)

Although the config here is generated internally, I think this could still be a good driving example to see how could we make a composite target that incorporates the backend specific config.

Thank you @Mousius for the RFC! It’s great to read about potential user experience issues of the current Target system, and happy to discuss about potential ways to improve it.

Proposeds API in the RFC

CompilationConfig, as proposed in this RFC, aims to improve UX by wrapping a list of targets, runtime and execution information in an extra layer of abstraction.

The core API is demonstrated in the RFC as:

config = CompilationConfig(
    target_host=target,
    targets=[target],
    executor=Executor("aot", {"interface-api": "c", "unpacked-api": True}),
    runtime=Runtime("crt", {"system-lib": True})
)

To improve the developer experience, a few other APIs are proposed along with the data structure:

CompilationConfigNode::GetExecutor();
CompilationConfigNode::ShouldLinkParams();

The compilation workflow changes from building with Target to building with CompilationConfig, as demonstrated below:

// The current API
void Build(IRModule mod, const Target& target, ...);
// The proposed API
void Build(IRModule mod, const CompilationConfig& config, ...);

Existing Work

As proposed in the target specification and composite target RFCs, the existing effort converges to the following items.

First, host is folded into the Target object, and the target_host parameter in existing build APIs, in fact, are left for backward compatibility. The CheckAndUpdateHostConsistency API developed by @zxybazh, is only used for backward compatibility reasons. Right now, the canonical way to specify targets with customized host is as easy as:

target = tvm.target.Target("cuda", host="llvm")

Second, in terms of multi-target and heterogeneous support, composite target is adopted as the current approach. Comparing composite target, which is target host plus a list of targets, with the proposed CompilationConfig, which is also target host plus a list of target, it seems very much following the same idea, while CompilationConfig is an extra layer of abstraction.

Third, canonical form of a Target is a JSON object, not a pain string. The target implementation already supports hierarchical parsing, e.g. target inside target inside array, etc. To support executor and runtime with attributes, we could extend the parser to support converting a JSON sub-object to an Executor/Runtime object, which is very much doable.

Discussion on the RFC

Overall, the RFC brings a dramatic change to the compilation infrastructure. This effort enforces a new assumption that we only have a single executor and a single runtime. However, I could see clean alternatives with more expressiveness, less effort required, no breaking change, but achieve the same goal.

First, under our unified IR efforts, the compilation in TVM is heading towards IRModule to runtime::Module abstraction. The executor, to the best of my understanding, is a runtime object that executes some artifacts that some BaseFuncs lowers to. For example, VM executor interprets VM bytecode, AOT executor may run the binary directly. Right now, there are some leaky abstraction, but our goal should be aligned under the direction that we address those leaks instead of bringing in more.

Second, the proposed APIs seem to be possible to be implemented with straightforward helper functions under the current abstraction. To give a line-by-line example:

ir_mod->GetConfig() -> CompilationConfig; // proposed in the RFC
GetTarget(id_mod) -> Target; // alternative

ir_mod->GetExecutor() -> Executor; // proposed in the RFC
GetExecutor(id_mod) -> Executor; // alternative

ir_mod->GetConfig()->ShouldLinkParams() -> bool; // proposed in the RFC
ShouldLinkParams(id_mod) -> bool; // alternative

In short, using accessor pattern here doesn’t bring in actual benefits, and can be replaced by simple helper functions.

Third, the RFC text doesn’t mention how it could improve the UX in TVMC command line. However, I would argue that the UX could be improved simply with target tags. For example, on CUDA GPUs, our target tag system supports creating CUDA targets with a single short string:

target = tvm.target.Target("nvidia/geforce-rtx-3070")

This carries all the information needed for a device, as long as we register them into our system, including compute version, shared memory size, local memory size, etc. This could perfectly solve the UX issue in TVMC by simply allowing target tags as arguments:

tvmc --target "nvidia/geforce-rtx-3070"

Last, there are cases where multiple executors working together. For example, if we want to offload some fragments to TensorRT executor, some to CUDA graph, while keep the rest in the VM, then the Relay function could potentially be partitioned into 3 Relay functions that targets to different executors. With composite target, we are able to attach different executors in the Target object in a more explicit way.

Conclusion

When designing the Target spec, it is intended to be considered as the synonym to CompilationConfig. I may not have all the context here and my understanding could be limited, but as heavily involved in the Target design, from my PoV, for now the benefit of the RFC seems to be limited to certain issues Target is already able to do. Happy to chat more!

Thanks for the interesting discussion.

@tqchen @junrushao ,

In terms of the definition of the target, I see two categories of arguments presented here :

C1 : The executor, runtime, should belong to the target – even if means duplication.

C2 : The targets should be hierarchical and recursive

For C1, I would rather use this argument to make runtime and executor an attribute of the target rather than to support calling an Array of Targets, another target. I can see this being true in the following scenario (as pointed out by @tqchen), if its a scenario we want to target for.

The following scenario is motivated by the fact it is economical run a single model inference across multiple machines considering the data transfer costs of intermediary tensors of the model. Just want to make sure if this is something the community considers as a compilation scenario that TVM should aim for.

For C2,

The examples presented so far does not seem to go beyond mostly a flat Array of Targets. Maybe in the multiple machine scenario, an alternative could have been a Array of CompilationConfig (or whatever we decide to call it). However, This would not be viable if we have recursive targets (where the recursion depth > 1)

Do you guys see a likely scenario in which we will have a Composite Target that is composed of Composite Targets of Composite Targets ? (i.e. where we cant express the target we want to compile to as a Array of Targets coupled with a host target – I believe host target differs only in the multiple machine scenario),

If that is the case, how would TVM establish codegen path divergence (partitioning at different levels of IR) to such an hierarchical target ?

Thanks @manupa-arm , building on what you said.

  • I do not think there is a strong contention on C1, the main point is that the target can be recursive. So a target like the follows is totally OK.
- kind: hetro-exec
- runtime : crt
- executor: vm
- devices: [ se_scope0, se_scope1 ]

So the argument is not about where/or how these field should be structured in a recursive data structure. Something that looks like a CompilationOption is OK from my pov. But the suggestion is that we make that one kind of (recursive) target, as from UX’s pov it can be seen that way.

  • I want to add C3: The ability to leverage tagging in target and improve the overall user experience is a very important factor.

I am going to discuss C2 on a separate post since that worths more examples.

Oh wow, I’ve been away for a few days and really appreciate the amount of discussion that’s arrived :smile_cat: Thanks @mbs-octoml, @zxybazh, @tqchen, @comaniac, @junrushao and @manupa-arm!

Firstly let’s address a few specifics which may help narrow the discussion slightly:

There’s an unfortunate overloading of the terms Executor and Runtime which is the inherent risk with a diverse heterogeneous compiler :smile_cat:. In this RFC, let’s define the Executor and Runtime as specific to the TVMs Executor and Runtime rather than the implementation of a Target. How a Target gets generated and linked is outside the immediate scope of the TVM Runtime and Executor as they’re designed to invoke the generated Target code.

Thanks @mbs-octoml, I missed some Prior Art here! In tvmc, we have the concept of a configuration as defined in Command Line Configuration Files. CompilationConfig would allow this to be a standard way of defining such as configuration with Targets within it - this meets the needs of the Target tagging which @junrushao and @tqchen are discussing by instead wrapping them into a CompilationConfig that represents the system. Within the Command Line Configuration Files RFC, it defines the <TYPE> and indicates the use of --config for cloud instances. The terminology would shift from a tagged Target to a CompilationConfig here to represent they exist at two different levels of the hierarchy?

As defined in Migrating Target Attributes to IRModule, splitting the TVM concepts of Target, Runtime and Executor means we can more clearly see what is most relevant to a specific Target. Which means that call-site annotations for Target are limited to only options that are relevant to a specific Target rather than to an IRModule. By virtue of working on this RFCs implementation, although we should still land the implementation agreed in the RFC, it does illustrate how we can better manage the representation of this configuration internally to TVM.

One reason not to motivate this as purely a tvmc concern is that tvmc is the CLI interface to TVM, if a user attempts to use tvmc and then moves to a Python script they should not be re-learning the interface to TVM.

This sounds sensible, and also a feature of CompilationConfig is the ability to specify the complete picture of the system which TVM is being built for including all Targets which can be used by all Passes. Specific annotations of storage and execution make sense to be defined at call-sites within the IR rather than at the top level with the IRModule - what CompilationConfig provides is a frame of reference to do those annotations and pick from a variety of Targets and Devices which the IRModule is constrained to. As we continue with Target registered compiler flow customisation, annotating a call-site with a Target will become standardised with the BYOC flow whether partitioned or otherwise to match the expectation you described with partitioned Targets.

This doesn’t rule out the possibility of using a composite Target as a Target in the targets list as we’re not redefining how that works here - rather defining a bounding box for system level configuration within TVM.

The end state for this configuration update would be to run a single pass over the CompilationConfig early on to ensure the internal state was correct using CheckAndUpdateHostConsistency which guarantees that subsequent Passes such as device or memory planning are safe making assumptions about the state of the used Targets. Hopefully that clarifies it’s less of a replacement, but more of a consolidation of the logic to early in the compilation flow if these checks are still required :smile_cat: We’d still need to have Target annotations within the IR and that Target will therefore have to be stable during compilation.

Where we’re at

Going over this thread a few times, the discussion revolves around:

M0. Split the CompilationConfig from Target

(CompilationConfig)
-> (Target), (Target), (Target)
-> (Executor)
-> (Runtime)

M1. Recursively allowing Target to represent any system

(Tagged Target)
-> (Target), (Target), (Target)
-> (Executor)
-> (Runtime)

It is my opinion, and the motivation behind this RFC, that better defining the CompilationConfig would relieve cognitive load on the user and provide definitions which can be bounded easily. By continuing to use M1 the term Target is increasingly overloaded and difficult for both developers of TVM and more importantly users of TVM. This hierarchical terminology has prior art in large scale cloud frameworks, such as Kubernetes which uses different terminology for Cluster, Deployment, Service, Pod and Container which are all different levels of granularity of computing resources; the decision here is both a UX decision and a practical separation of concerns for both users and developers of Kubernetes.

To elaborate on C2, while it is desirable and recommended to have consolidated runtime, executor choice when possible. Naturally there are cases that would requires a bit of generalization. The multi-machine case is one example.

There are also other examples that can appear on a single SoC. Consider the following scenario, where there is an accelerator that comes with a CPU-like co-processor as controller.

- host: arm
- runtime: vm
- vdevice0: accelerator-with-coprocessor
    - host: risc-v
    - runtime: graph
    - device: my-accelerator

In this case, the host is a ARM chip that drives the overall computation(say through VM). The co-processor, however, also comes with its own controller, that is able to execute a sub-graph of computation, which in turn dispatches to my-accelerator. As a result, we will need to compile a tvm runtime(that may be different from the host) one, and use that to drive the graph computation on the co-processor.

To expand on the BYOC case, note that for BYOC that involves a sub-graph, the specification for the BYOC “target” is in nature a “CompilationConfig”-level structure. Because we would need to specify what is the leaf level target(cuda), as well as graph runtime runtime(TensorRT or cuda-graph). This brings another need to be able to embed a “CompilationConfig”-level structure in a “CompilationConfig”-level target.

Back to the compilation path. I agree that it is important to build a standard pipeline. I would also like to note that we need to design to be compatible of emerging needs. Allowing target specification to be recursive, while validating them, would help the ecosystem to develop these capabilities. Additionally, some of the needs can appear now, for example, we could see a need to have a more flexible VM runtime that drives GPU computation, while offloading subgraph to cuda-graph(more efficient and less flexible). While may not be possible to consolidate every compilation path in the beginning depending on the use case we talk about(just like initially we do not have unified single device and multi-device exec). Having a common config API(target), would bring a solid step toward unifications as the community work on these cases. It also provides a standard way for community to do extension in a composable way, without inventing other things that are not compatible to each other.

In reality, different target kind may have (slightly) different compilation path, although they can share a lot in common. In the case of compositional target like multi-device execution, the compilation pipeline of the multi-device exec needs to divide and then offload to the compilation pipelines of the specific target kind then link them together(in our case PackedFunc is out ABI).

Finally to build on @Mousius 's point. Allowing target to be recursive does not preclude structure or naming. Targets have kinds and schemas that attached to each kind. Further validation can also be done throughout the process. So instead of

(CompilationConfig)
-> (Target-CUDA), (Target-X86)
-> (Executor)
-> (Runtime)

We would get

(Target-Kind=Hetro-Exec)
-> (Target-Kind=CUDA), (Target-Kind=X86)
-> (Executor)
-> (Runtime)

From the UX’s pov, we do not need to force user to pass in such compositional ones(that is complicated) if they only care about single device execution (and canonicalize internally).

As a matter of fact, majority of the use cases we face right now are still under a single device scenarios and we want to make these cases simple for the user. CompilationConfig as it is right now is a union class of two kinds of targets:

  • Single device target where only a host and target is involved
  • Multi-device target where multiple devices are involved.

Being able to clearly differentiate the two and allow simpler UX for common single device scenario can be a plus for the users.

Regardless of the use cases, you will be able to leverage the tagging features at different level, so user can just pass in

build(mod, target="my-hetro-exec-platform0")

Hi @tqchen, I can understand that a recursive Target could be the solution to a multitude of problems but it also introduces an over-arching ambiguity for both users of TVM and developers. It also creates a maintenance overhead of trying to manage an increasingly diverse definition of Target rather than a set of simple component definitions for use in the TVM compiler.

Coming back to this, the LLVM Target provides a specific set of constructs specific to a single output which constrains it and makes it easy to interpret. TVM as a heterogeneous compiler encapsulates many Targets, of which we can have a multitude. TVM Targets can be defined at the same conceptual level as other compilers. By taking similar concepts and mapping them appropriately we create not only a good user experience but also a good developer experience where terms are mapped to a single role in the compiler. In this case Configuration represents the entire TVM configuration, Targets map to the same layer of the hierarchy as the actual backends themselves.

This is a great example of where the Target means something different as you recurse through different levels of Target. To motivate this further we can extend the example (using the Deployment conceptual example from Kubernetes):

M0

(CompilationConfig)
-> (Deployment)
    -> (Target LLVM), (Target OpenCL)
    -> (Executor VM)
    -> (Runtime CPP)
-> (Deployment)
    -> (Target LLVM)
    -> (Executor Graph)
    -> (Runtime RPC)

M1

(Target)
-> (Target)
    -> (Target LLVM), (Target OpenCL)
    -> (Executor VM)
    -> (Runtime CPP)
-> (Target)
    -> (Target LLVM)
    -> (Executor Graph)
    -> (Runtime CPP)

M1 introduces increasing ambiguity where-as M0 provides clear terminology and statically available information. We may choose to introduce the additional level of Deployment or similar in future given the use-cases @tqchen describes around cloud platforms (or not as necessary in future as the compiler evolves). Note-ably the concept of Target is still the same concept as from other compilers we use as generators in M0.

The Executor and Runtime represented in the CompilationConfig are the TVM Executor and Runtime, Target-specific implementations are kept to within the Target itself. This maintains the connection between the Target and the backend in use, whereas the Configuration encapsulates the TVM collective view of the world.

Thus for the above case it’d simply be:

(CompilationConfig)
-> (Target LLVM), (Target CUDA)
-> (Executor)
-> (Runtime)

Taking CUDA as a BYOC Target with a graph partitioner, this would be pre-marked as part of the overall IRModule for the given nodes. This is exactly how BYOC operates today and this RFC does not aim to change this behaviour.

Agree that consolidating all of the paths is going to take time and effort, and dealing with emerging requirements is a long standing need for any software project. As this RFC aims to supersede a previous RFC, future RFCs should aim to further iterate on this concept.

The distinction proposed in this RFC is that Target can continue to prevail for simple use cases where you target a single backend and be wrapped by TVM configuration (however that is defined) internally. The Configuration is the container for the actual complete internal representation for TVM. This can be achieved by checking the incoming type and creating the default wrappers where appropriate, but they’re at different conceptual levels from each other.

Being able to quickly and easily articulate the usage of both Configuration and Target creates a simpler and more approachable project for both developers and users. A further general motivation is the engineering practice to model and define core constructs within the architecture and provide separation of concerns, single responsibility and a clear hierarchy of components.

Thanks @Mousius . Some clarifications, in the case of BYOC, there needs to be a nested level (Target BYOC)

(Target CompilationConfig)
-> (Target LLVM), (Target CUDA)
-> (Target BYOC CompilationConfig)
    -> Runtime = cuda-graph
    -> Target = cuda
-> (Executor)
-> (Runtime)

To build on what you said. I think we all agree that structure is useful. In the case of target, the structure is represent as specific kind of target on that layer, e.g. we can call have a target kind follows the same terminology you come up with. For example, we can have a target kind that is called Deployment and another target kind that is called CompilationConfig(or a better name). With additional benefit of being able to use the tagging mechanism.

Hi @tqchen, could you explain why this is necessary? As the we integrate Target registered compiler flow customisation doesn’t this just become a Target("cuda-graph") which has the relevant BYOC infrastructure registered to it and Target attributes for configuration?

Given one of the motivations here is to simplify the compiler flow and user experience by creating pre-defined structures rather than introducing more dynamic behaviour, I’d suggest it’d be better to keep Executor and Runtime separated as agreed in Migrating Target Attributes to IRModule which leaves Targets represented at the correct level of the hierarchy and not create further confusion as to the definition of a Target. Though it’d be good to hear if others have strong opinions one way or the other.

Hi @tqchen, could you explain why this is necessary

In this case, cuda-graph corresponds to the implementation of the graph executor (a coorection,cuda-graph in this case corresponds to the executor) in that BYOC module. And does not corresponds to the leaf-level target(CUDA). The BYOC still need information to specify how to generate kernels that are fed into the cuda-grap based executor, which is cuda. Additionally, there can be other fields such as libraries(e.g. TensorRT or cuDNN).

In short, some of the BYOC happens at graph level, which means they can benefit from CompilationConfig style compositional configurations.

It’d be better to keep Executor and Runtime separated

The particular RFC proposes to move executor and runtime away from the leaf-level target(LLVM, C) that generates operator kernels kernels. I agree with that logic – a “c” target do not gave to contain a “executor” and “runtime” field because they really specifies the components of the “graph driver” side of the program.

To translate to target structure, it will mean that the schema of target-kind=Deployment (or another name) would include a field of “executor” and “runtime”. But validation will reject a “c” or “llvm” target that comes with such a field.

The remaining difference is whether a structured class is necessary. I wonder if it can be addressed by having some of the structured object to subclass target and provide helper functions to access these fields, if that is really necessary.

The last thing that I would like to bring up again is the ability of tagging, which can be quite useful to our users. Specifically, use a tag (e.g. “my-hetro-exec-platform0”) to refer to the entire compilation configurations or some of the sub-components, that includes runtime/executor and host/device target specifications. One of the motivation for making things as one kind of target is to have that capability of some kind.

Trying to summarize and dissect:

  • A0: We all agree that “executor” and “runtime” field should be moved to something that is not a leaf level target
  • A1: There is a discussion about whether or not to make the second level compositional config a (subclass of ) Target
    • A1a: Create a separate struct, with the benefit of explicit fields
    • A1b: Create a separate structs that subclasses Target
    • A1c: Create a specific target kind whose schema corresponds to the second level class.
  • A2: There is a discussion about the need of recursive support in some of the second level config in the case of BYOC, cross device, and multi-node
  • A3: The capabilities of tagging a composed (target) config can be useful from UX point of view.

Thanks for the breakdown @tqchen, there’s some confusion here as to what the current behaviour is which I’ll try to clarify based on your points.

Migrating Target Attributes to IRModule agrees removing the TVM Executor and Runtime from the Target altogether and attaching them to the IRModule as attributes due to being more related to the compilation than the Target. This separates the concept of a Target executor/runtime and a TVM executor/runtime.

BYOC modules currently pass configuration via PassContext::Global() and should be able to use the same Target attributes as Targets when Target registered compiler flow customisation has been fully implemented. Currently, BYOC is registered at the same level as Target:

(CompilationConfig)
-> (Target LLVM), (Target CUDA), (BYOC MyCUDA)
-> (Executor)
-> (Runtime)

In future, BYOC should be a Target proper. In both cases of BYOC or Target, there is no need to add hierarchy to Target here as the Graph will be partitioned for MyCUDA before the CUDA target.

In the the Command Line Configuration Files RFC, Configurations themselves can be referenced by name with flexibility as to how they are defined in tvmc. Thus you can either give a Target configuration a tag for a single Target or name a complete Configuration, both with defined terminology and placement in the overall TVM hierarchy.

With the above series of interdependent works, I believe that A1a is the simplest and most natural for both users (who can reference a complete configuration or a Target tag if desired) and developers (who can easily ascertain the attributes of the TVM compilation from the structure). Both users and developers will benefit from the consistent and straight-forward definitions of terms within the hierarchy which we can include in documentation to explain how they are composed.

To further clarify on the BYOC part, there can be need that a BYOC module contains a graph-level property(of the executor that drives the MyCUDA graph) and kernel level property(of the code generator that outputs so it is indeed hierarchical and from functionality pov close to the CompilationOption.

Another thing that is related is the issue of serializing the configurations into logs. While this is not a requirement right now(most of the tuning itself serializes the target that only involves the per device part). Imagine we start to do a global search over a graph, in that case we need a way to serialize the config itself into the log(in this case there is a json format).

Of course, both the capability of serialization and tagging can be duplicated in each level of the structure, but can also benefit from some form of uniformity.

@Mousius thank you for raising this RFC and thanks for great discussions everyone.

For the most part I support the originally-proposed RFC.

I fully support A1a here. While it is tempting to try to define Target as a structure which models an arbitrary runtime environment, in practice, the range of runtime environments supported by TVM will change as TVM’s tuning capabilities grow. Additionally, Target currently plays a foundational part in the present AutoTVM design: it describes all of the compiler configuration which could affect a given autotuning measurement, and is therefore used as a key to describe the workload in autotuning logs.

Further, at present, there are things inside Target which do not impact autotuning:

  • --link-params
  • --executor
  • --runtime

Because of this, right now users can get into the undesirable experience of tuning a schedule without one of these parameters, then compiling for deployment with the parameters included, and seeing untuned implementations. Now, I bear some of the blame for this because I started this pattern in Target. However, it’s something we need to get rid of now that we have more tunable schedules landing in microTVM.

The fix for this is to remove these parameters from whatever we use to key the tuning logs. Currently, that’s Target.

So in my book, that’s also the definition of Target right now:

  • the set of options which could influence autotuning on one tvm::runtime::Device.

While I do support the effort to gradually improve TVM’s ability to model an arbitrary heterogeneous system (e.g. even those with multiple executors spread across a set of independent machines), modeling this inside Target means that we need to simultaneously confront two questions whenever we want to broaden Target with additional configuraiton:

  1. does this configuration affect autotuning?
  2. who is consuming this configuration?

Adopting A1a allows us to just answer the second question up front by grouping compiler configuration into data structures according to the compiler component which consumes them. Broadly, we have these areas which may need to consume compiler config:

  • Op-level code-generators (currently, this is the lowest common denominator describing what the Target options cover)
  • Graph-level code-generators (e.g. AOT, Graph, VM)
  • AutoTVM (e.g. parameters which may control scheduling)
  • AutoScheduler (e.g. parameters which may affect TensorIR lowering)
  • flow-level parameters (e.g. parameters which may be in PassConfig but which should potentially be captured into tuning logs such as tir.disable_vectorize)

Organizationally, my position is that it’s better to keep parameters grouped alongside others which are consumed by the same logical component of the compiler. This recognizes that the questions of scoping autotuning and modeling an execution environment are larger than any one RFC and are questions which TVM as a community will continue to refine as new improvements such as AutoScheduler, AutoTIR, etc are introduced. Adopting a composite structure provides a framework to keep things organized as we incrementally improve the compiler rather than defining a single open-ended struct.

This approach then argues for the following:

  • We adopt A1a, a composite top-level configuration structure which consists of pieces mapped to each compiler component
  • We tighten the definition of Target to mean “configuration parameters for a single codegen which affect autotuning.”
  • To accommodate the previous bullet, target_host is hoisted out of Target and becomes its own Target. See commentary in [RFC] Unified device/target/memory scope planning with regards to plans to add human-readable labels to Targets (e.g. dsp-cpu, low-power-cpu).
  • Autotuning keys continue for the moment to be confined to the contents of the Targets.

My position on this discussion is that we should still keep the configuration pieces organized according to the consuming compiler sub-component and express any relations in a sibling top-level structure. Here is an example of that in a futuristic world where we support splitting a model across multiple top-level executors:

{
    "targets": {
        "dsp-cpu": {
            "kind": "llvm",
            "mcpu": "cortex-a72",
        },
        "gpu": {
            "kind": "mali",
        },
        "low-power-cpu": {
            "kind": "llvm",
            "mcpu": "cortex-m0",
        },
    },
    "executors": {
        "dsp": {
            "targets": ["dsp-cpu", "gpu"],
            "target_host": ["dsp-cpu"],
            "executor": "vm",
            "runtime": "c++",
        },
        "low-power": {
            "targets": ["low-power-cpu"],
            "target_host": ["low-power-cpu"],
            "executor": "aot",
            "runtime": "c",
            "flow-config": {
                 "link-params": true,
                 "enable-byoc": ["cmsis-nn"],
            },
        },
    },
}

This is quite a forward-looking example. In practice, the effects of adopting A1a look to me at present like:

  1. Defining target_host as merely one of the sub-Targets included in CompilationConfig
  2. Splitting out the executor, runtime, and link-params keys from Target
  3. Avoiding introducing any recursion, which means I think that we should not adopt that aspect of the Composite Target RFC.

Great discussions so far. I think we have a good picture of what the choices are in terms of the data structures(the As), and we have different preferences in terms of choices.

Before we jump into the particular preference, it is helpful to look at different use scenarios that we are using the data structure and objectively analyze them from the following angles:

  • The UX interface
  • The feasibility of each kind of solutions under the needs
  • Possible pros and cons

Notably, the final preferences usually are not disagreements on the objective analysis. For example, I think that we all agree that recursive structure is more expressive, having an explicitly typed config is slightly more convenient than a specific target kind with the same schema for the particular use-cases that involves a two level structure.

Usually our preference is a result of how do we weight the different needs and pros and cons. Additionally, we may have a specific need(use case) in mind. To make a good choice, we would need to look at a broad class of needs. The bottom line is hopefully we can agree on the objective needs and analysis, then use them as basis to talk about the choice(that involves preference).

It is also very helpful for us to review the previous RFCs that comes to the current suggested design of Target and Composite

N0: Common use case, single device with host

While a lot of motivation in config comes from heterogenous devices, which is important. The most common use case we have right now is still the scenarios under a single device. Of course like CUDA, single device usually means there is a need of host driver. So one of the key need is how to make this type of usage as streamlined as possible.

From the user’s point of view, the program itself is as plain as “CUDA”. However there are two different states of functions during the phase of transformation

  • E0: A mixed host-device program
fn () {
   // cuda part
   b = alloc("global", size)
   launch cuda kernel 1 {
   }
   launch  cuda kernel 2 { 
   }
}
  • E1: A device program
   launch cuda kernel 1 {
   }

Both E0 and E1 can appear in different phases of transformations. From the users’ point of view, it is extremely helpful for them to be able to have attributes that specifies the constraints on both kind.

In the convention right now, E0 is achieved by the host field in a Target. While in the case of E1 it is simply a device program. Under the two-level config view. The host of E0 would can be obtained from the context Config(per target_host field).

  • From the UX’s pov, directly pass in Target with an optional host field present a simple API for this particular use case.
  • Having host under Target would make the constraint more explicit at the function level and differentiate E0 and E1.
  • For more complicated heterogenous case, having host under target would cause duplication, in which case a consistency checker and updater is needed.
  • Having an explicit host in the target can help the case where there are multiple host env, although this is also a rare case.

I will skip the personal preference comments for now.

N1: Embed into other systems

In a lot of cases we are thinking about generating a program that TVM take full control of allocator, device management and so on. So there can be a temptation to enforce precise heterogenous device info everywhere. On the other hand, at the PrimFunc level, we also need to be able to embed into other systems, and take decisions from the calling env. For example, in most of the cuda op-level case, we generate functions that works on any GPU and switches the context based on the device_id and type from the arguments.

For this particular need, we need to keep the target specification simple at the boundary level, that only involves host and device information. While leaving some of the device planning information at the driving part.

N2: Tagging and quick reference

The ability to tag and reference a configuration as a whole is one the key design of the Target system. From the user’s point of view, they do not necessarily cares about the codegen level concept. Instead, it is important to present the target environment as a whole. See the following example tags:

  • aws/c5: cloud instance name
  • arm/rasp4b: soc board name
  • nvidia/jetson-nano:cuda: soc board name

From the users’ pov, what they ultimately care about is what I want to deploy to. Being able to refer to the setting(or part of the setting) through tagging is an important for that experience.

N3: Represent a complicated heterogenous environments

One of the main motivation of the second level Config is to represent a more complicated heterogeneous environment, that is different from N0. Under such cases, there is a desire to propagate through some of the (virtual) device and memory scopea information across functions.

For this particular use case, an explicit config offers the a clear structure. A specific target kind with schema that follows the config can also implement the same feature.

One possible choice is to model everything in this way, as complicated cases cover simpler setup through another layers of wrapping. Fitting simpler common scenarios into a two-level setting may bring additional complications in UX. Especially if there is an ask for explicit construction.

N4: Ability to decompose

Through out the compilation and transformations. In a lot of cases we are decomposing problems into smaller problems. A function in IRModule can represent

  • A multi-machine program into single machine ones
  • A multi-device program into driving calls into single-device, host driving functions, but still invokes through PackedFunc(that contains a host part)
  • A single device, host driving program into device and host functions.

In the BYOC flow

  • A mixed-BYOC strategy program into multiple functions with own BYOC target
  • There can be a need for downstream BYOC to further decompose that into graph level executor config, and single kernel code-gen setting.

Through out the transformations we de-compose, and likely also tag the functions with possible constraints(that this particular function must satisfy). Having a common base for the constraints(for functions at different granularity is helpful. Given the nature of the framework is to be able to support and be future compatible to these decompositions.

N5: Automation needs

This ties back to N4. We need a common base config to indicate the constraints that the auto-tuning environment presents. Our most common case right now is single device with host setting. In such cases, target itself is only needed as part of the log.

If we see automation need as the need to be able to search over transformations of a program, subject to certain “target constraints”. Then naturally we will extend the scope to handle functions at different level(related to N4). For example, graph-level tuning would be one such example.

Considering the need to unify the automation infrastructure, it is certainly very helpful to have a common data structure to represent “target constraints” at different level(which can include executor configurations) so that there will be one serialization format and relatively streamlined mechanisms to handle all transformation cases(of a single device program, and executor device mixing case).

Hi @tqchen, I like your point that we need to be able to a) handle a lot of different setups and b) be adroit at changing focus as we transition from the overall systems view (eg during device planning), to target/host view, to specific device view, and so on. (Oh and I’ve probably broken things in the CompilationConfig stopgap I implemented since it assumes every Target needed for lowering must have a host, which breaks the E1 case.) So I see why folks are keen on the general recursive representation. And I could see that we’d want to replace the ‘config’ accessible from the IRModule as we change focus, especially as we transition into per-Target compilation.

One counterpoint to that approach is the resulting fragility of the passes that depend on it. E.g. I could imagine we end up with a lot of ICHECKS and accessors scattered inside pass impls which may not be apparent from the outside. (It reminds me a bit of the Windows Registry – a wonderfully universal and centralized data structure with opaque dependencies – but that’s unfair!).

Perhaps we could take an intermediate step: Explicitly enumerate the family of ‘compilation configs’ we already have as distinct classes. I think so far that’s

  • just-a-Target for eg lowering without worrying about the host shim
  • HostAndTarget for your E0 case
  • MulitTarget, which is what I got myself tangled up with in device planning and needed the CompliationConfig to help centralize some logic. There’s going to be a runtime & executor in each of those. We’ll also see some semi-generic way to go from cmd-line settings and configs into those classes. But perhaps we just don’t worry about that duplication just yet in return for clarifying what we support today (and save me from breaking anything else).

Then we could revisit with a more universal & recursive representation, particularly if we want to tackle the x-runtime/x-executor cases.