[pre-RFC] Compilation Configuration Representation

manupa-arm · January 21, 2022, 10:38am

Let me try to summarize the conversation as I understand it – Please feel to correct me if its wrong.

It mainly boils down to the following point :

What should be attached to an IRModule and what shouldn’t ? According to @tqchen’s description above, it should be C1-style “constraints” and not C0-style “how”. The argument being, C0-styled information are a configuration for passes and not broadly applicable to all transform passes, thus confuses the pass implementation with the choice of what to do with them.

Mousius:

config = CompilationConfig(
    target_host=target,
    targets=[target],
    executor=Executor("aot", {"interface-api": "c", "unpacked-api": True}),
    runtime=Runtime("crt", {"system-lib": True})
)

According to the definition of C0 and C1, the above information should be classified as C1. Therefore, are we all agreeing to the fact that is reasonable to be attached to the IRModule ?

As a first step, if we all agree it would be great to unblock ourselves to use a C1-styled CompilationConfig attached to IRModule to proceed in short/medium term. @tqchen @Mousius @areusch – An explicit reponse to this question is highly appreciated

Now coming back, in today’s state of TVM, C0-style broadly refers to PassContext – Im not aware of anything else. Therefore, the current point presented argues against putting C0-styled PassContext either as a) IRModule attribute or b) part of C1-styled CompilationConfig that is already agreed to be attached to IRModule.

Then, for future work, we should think about “necessity” of keeping the C0-styled PassContext as a side channel (or infact a singleton). IMHO, this contradicts slightly with what @jroesch proposal of integrating the whole compilation pipeline as IRModule → IRModule transformations by committing ourselves to maintain a side-channel driven from the need of the separation of C0-styled vs C1-styled information.

Therefore, it would be great to explore options how to attach/package all possible information that “might” (C0+C1) – not just the “minimum”(C1) – be required from all passes. We thought this could be done by attaching to the IRModule – so that we could export without requiring any other side-channels. However, we are open to hear alternatives.

tqchen · January 21, 2022, 3:22pm

Thanks @manupa-arm , these clarifications pts are helpful.

First of all, the discussion of C0-style and C1-style is more on the design principle level and do not ties into the actual choice of data structure or implementation. In some cases they could imply certain preferences of choices, e.g. calling a C0+ C1 combo as a CompilationConfig certain makes sense, and if it is a C1 only thing something in the target namespace is more natural. But let us first separate these concerns and not talk about the choice of data structure.

The main design principle is as follows

At the IRModule and individual pass(IRModule->IRModule) configuration level, C1 type config should be attached to IRModule while C0 should be handled separately(and not attached to IRModule)

As a first step, if we all agree it would be great to unblock ourselves to use a C1-styled CompilationConfig attached to IRModule to proceed in short/medium term. @tqchen @Mousius @areusch – An explicit reponse to this question is highly appreciated

We agreed that the above information(executor, runtime) are part of the C1-style configuration and it is helpful to introduce a data structure to store those information.

We were originally discussing whether target.packaged(as target was the namespace used for constraint style configurations, consistent with the previous composite target RFC that was accepted) or introduce a separate data structure CompilationConfig(this RFC).

Regardless of the data structure of choice, they are going to unblock the following features since both proposed data structures are isomorphic.

The main intention of the last few post however is to clarify the design principle, since this have a bigger impact than the choice of data structure.

. IMHO, this contradicts slightly with what @jroesch proposal of integrating the whole compilation pipeline as IRModule → IRModule transformations by committing ourselves to maintain a side-channel driven from the need of the separation of C0-styled vs C1-styled information.

The discussion of separation does not advocate for the use of PassContext or any particular data structure. But mainly focus on the importance of separation of two types of informations and only keep C1 style in the IRModule.

This does not contradict the IRModule->IRModule transformation as whole pipeline and actually is an faithful realization. The main motivation of IRModule->IRModule transformation comes from the need of compositionality. We have already have extensive discussions on this point in some of the previous posts.

The IRModule->IRModule principle originated from Tensor->Tensor principle in deep learning system designs, where the key data structure(Tensor or IRModule) contains all the necessary information that are needed to carry among the sequence of actions. They do not imply however, that all the configurations of previous actions(that are irrelevant, sometimes impossible to list comprehensively if we are in a look) should be recorded in the data structure. As we can see in the example of deep learning framework design.

Finally I want to say that there can be a need for a C0+C1 style config(let us call it C2) on a high-level application(one realization of pipeline) that composes the passes together. For example, there can be a train_resnet application that comes with a argparse.opt that contains learning rate, layer configurations as well as the device we want to run on. The C2 object then separately configs the C0 and C1 configs using lower level mechanisms(where they are separately) and drive the end to end compilation.

My read of the RFC is that there is a desire to have something like that. To follow the precedence of deep learning framework modularization. What that implies is to keep the C0 and C1 mechanism, perhaps introduce a C1 target.packed object as the data structure that attached to the IRModule. Then also introduce a C2 CompilationConfig that only ties to one compilation pipeline(perhaps the default one used by tvmc) at a different abstraction level. C2 config will populate the C0 and C1 style configurations.

This would indeed bring a bit more duplications. But like in the case of deep learning frameworks. Such duplication is necessary for modularity at different abstraction levels. Mainly because centralizing everything in C2 is not sufficient for all possible composable pipelines(see the previous examples on searches loops and alternative paths) nor minimum for pass writers to increase composability.

This is a case where precedence designs(deep learning frameworks) are really matured and can serve as really good reference pts. It is also a case where the lesson of deep learning frameworks shows that such choice is critical to the success of the framework as a whole, so it would be good for us to consider that.

Mousius · January 21, 2022, 6:21pm

Hi @tqchen,

I appreciate your reply with further clarifications, though I’m struggling to reconcile them with the original RFC presented here.

Although they’re isomorphic in structure, the packaged Target has no proven advantage and serves to increase the overall complexity of any additional work in TVM due to the considerations of a potentially recursive Target. I would need a strong motivation for implementing such a complex design, given this RFC aims to reduce complexity by creating explicit structures.

tqchen:

Finally I want to say that there can be a need for a C0+C1 style config(let us call it C2) on a high-level application(one realization of pipeline) that composes the passes together. For example, there can be a train_resnet application that comes with a argparse.opt that contains learning rate, layer configurations as well as the device we want to run on. The C2 object then separately configs the C0 and C1 configs using lower level mechanisms(where they are separately) and drive the end to end compilation.

My read of the RFC is that there is a desire to have something like that. To follow the precedence of deep learning framework modularization. What that implies is to keep the C0 and C1 mechanism, perhaps introduce a C1 target.packed object as the data structure that attached to the IRModule. Then also introduce a C2 CompilationConfig that only ties to one compilation pipeline(perhaps the default one used by tvmc) at a different abstraction level. C2 config will populate the C0 and C1 style configurations.

As in my previous post, I didn’t mean to encourage the notion that only the tvmc flow was considered when presenting this RFC. C2 is where tvmc is right now, working around the limitations of the TVM API, with graph partitioning tvmc creates its own pipeline on top of TVM to make the feature simple to use. The RFC is therefore aiming to bring some of the learning from tvmc back into the main compilation flow, with some of the advantages listed in the original post that point towards construction of a configuration, whether it be C0 or C1 for use in any compilation flow.

Taking a step back, if we consider this RFC to be adding the C1 type of configuration, is the requirement for moving this forwards that we must also define a mechanism for C0 configuration? Or can we leave dealing with the global state of PassContext to a future RFC where-in we can discuss how to better manage C0 configuration?

Furthermore, if we accept that C1 configuration can be attached to an IRModule, what prevents us proceeding with the CompilationConfig initially suggested given we still have yet to see a clear motivating example as to why we need a recursive definition of Target?

junrushao · January 22, 2022, 12:02am

The advantage of package Target has been extensively discussed in our previous posts.

To clarify, in production, there exists non-trivial usecases with Target. For example, there might be CPU + DLA + GPU case, where Target does need to be expressive enough to represent them. As a simplest example, the config of Jetson is:

TVM_REGISTER_TARGET_TAG("nvidia/jetson-agx-xavier")
    .set_config({{"kind", String("cuda")},
                 {"arch", String("sm_72")},
                 ...,
                 {"dla", SomeOtherTarget},
                 {"host", SomeOtherTarget}}});

In our general principle, we do need C1-style configuration for mixed-device compilation. Notably, this configuration could differ from the IRModule-level annotation, if we intend to restrict the possible constraints during optimization.

Second, in BYOC, there is actual need to pass in additional recursive configuration. For example, some BYOC target needs additional configuration, e.g. as TQ mentioned previously, the composition below is a possible need:

- host: x86
- vdevice0: byoc-myawesome-cuda
    - host: x86
    - runtime: cuda-graph
    - vdevice0: cuda
    - library: tensor-rt
- vdevice1: cuda
- runtime: vm

Overall, the design argument here is a subjective matter. As we can see, in the following discussion, introducing a separate class for single&multi-device constraint also brings additional design and engineering complexity for logging/tagging and the overall compilation pipeline, so it’s really a trade-off.

Notably, packaged Target doesn’t mean it is unstructured or encouraging arbitrary recursion, we can of course enforce schema validation to make it structured and ensure correctness here.

Additionally, we do see benefits to have a common base class for C1 type data structure. From the automation point of view, we need to record the constraint in different cases, and as we have a common base class (Target), it would help with tuning log serialization for both single&multi-device functions. Furthermore, it also brings additional benefit in terms of consistency of representation. As a real-world usecase, if the constraint of a function is annotated as DLA + GPU, it’s relatively easy to narrow it down to a GPU-only function instead if we use the common Target class here - and in this case, it’s helpful to represent DLA + GPU and GPU-only constraint as a common data structure for consistency; the same idea applies to host/device split pass in TIR.

Finally, we would love to reiterate the the advantage of packaged Target and from our PoV, it helps with more generic usecases and maintains the clarity of TVM’s compilation pipeline.

tqchen · January 22, 2022, 12:08am

Thanks @Mousius . I don;t think we need to settle down on mechanisms for C0 configuration. The original intention of the RFC appeared to be C1 style but then the discussion drove it towards C0 + C1 style.

So I agree that we should figure out the data structure choice for C1 style configuration in this post.

We all understand and agree the possible advantages bought up a single point setup. Just that there are also other side consideration in terms of compositionality, consistency, and extensibilities. as some of the posts bought up.

The suggstion of C2 style CompilationConfig that translates into C0 and C1 style at lower-level actually is meant to serve as reconciliation here that learns from the previous precedence in deep learning frameworks.

Mousius · January 24, 2022, 12:15pm

There is some need to name a specific of Targets configuration, which is present in the CompilationConfig and matches to how the TVM partitioning pipeline currently behaves without the configuration (it is already planned to use named configurations in tvmc and we’re not planning to remove Target tagging). I can’t see the lines you’re referring to as motivating though:

github.com

apache/tvm/blob/813136401a11a49d6c15e6013c34dd822a5c4ff6/src/target/tag.cc#L73-L81


#define TVM_REGISTER_CUDA_TAG(Name, Arch, SharedMem, RegPerBlock) \
  TVM_REGISTER_TARGET_TAG(Name).set_config({                      \
      {"kind", String("cuda")},                                   \
      {"arch", String(Arch)},                                     \
      {"shared_memory_per_block", Integer(SharedMem)},            \
      {"registers_per_block", Integer(RegPerBlock)},              \
      {"max_threads_per_block", Integer(1024)},                   \
      {"thread_warp_size", Integer(32)},                          \
  });

There is precedence for using the list of Targets successfully in tvmcs Target parsing infrastructure and as part of a multi-Target workflow (see: microNPU Demo) which is being actively used and extended to incorporate multiple devices and Targets.

This is also currently supported as part of the existing BYOC flow currently in use for multiple Targets as evidenced by the aforementioned demo where the microNPU is configured separately. Extending this further is definitely something to explore, but given the functionality exists today it is unnecessary to do it as part of this iteration.

Based on the evidence presented in this post, the current behaviour of the TVM codebase is demonstrated, showing that CompilationConfig is a step towards further supporting the features which already exist without the need for an increase in complexity for Target. Introducing a different mechanism for this is unnecessary given the existing functionality and this RFC is in fact a small iteration on the existing approach to codify the behaviour.

Target with-in this implementation is similar to a dict with a schema? Why is this favoured over an explicitly typed class which has the same benefits but additionally has compile time checks and clearly documented form in code? As for serialization, the encapsulation of auto tuning serialization logic into a configuration object would be a clear boundary for auto tuning to work from which can still invoke the various Targets serialization functions. I don’t see this as any additional effort over implementing such a function for a packaged Target.

In the post What is ‘target’ in TVM? it is clearly demonstrated that this overloaded use of the term Target does not create clarity but introduces further confusion to both users and developers. This has also been articulated over the previous posts.

areusch · January 25, 2022, 8:09am

Discussed further to understand @junrushao and @tqchen’s perspective today. Here are some notes:

Summarizing TQ/Junru’s proposed config philosophy: Compiler configuration can be categorized into two rough categories:
- C0: Configuration which describes the method by which a part of the compiler should work. Most often, this config affects only one pass and can be thought of as configuration specific to the Pass. Currently we store this in PassContext. Most (all?) C0-style configuration should not be used from one pass to configure a downstream pass.
- C1: Configuration which describes the environment where the generated code will run. Here, based on precedent from other deep learning frameworks (@tqchen can cite the specifics), we prefer to express this in terms of a set of constraints. Based on the precedent, constraints are a good representation here because it allows the compiler internals to remain as flexible as possible. In order to allow the compiler to continue to evolve and compose as new libraries, targets, and deployment strategies are added.
  - Because these are constraints, we should choose a representation that specifies the range of each axis. Passes should be viewed as a set of processing steps which may modify (typically, narrowing) the range of each axis. When compilation is done, the compiler generates code to match the remaining constraints.
One of the goals of CompilationConfig is to consolidate compiler configuration so it’s easy to manipulate on disk and simple to understand the supported configuration.
- CompilationConfig shouldn’t be seen as replacing C0 (e.g. PassContext) or C1 (e.g. Target) style config. Andrew: it should be seen as a composite data structure containing a representation of both. This allows parallel compilation flows to be built and maintained independently of CompilationConfig, which certainly could be used elsewhere but is primarily motivated by tvmc.
- As a starting point, it would be great to consider CompilationConfig as the method to specify configuration for the tvmc-based flow rather than as the singular way to configure all tvm.relay.build.
- Andrew: A desirable property of CompilationConfig is that the pieces of the composite struct which correspond to the compiler internal configuration are trivial representations of the actual structures used internally.
- For PassContext: this is essentially restricting the data types of the values and defining a serialization into an e.g. yaml or json map.
- For Target: this gets more complex. We can split this problem into a couple parts:
  - Let us define a Leaf Target as that subset of a Target configuration specific to a single Target subclass. In other words, exclude any relation between targets and the Executor and Runtime configuration. This part is essentially a schema’d version of the PassContext problem.
  - More complex are the Executor, Runtime, and “packaged” Target proposals discussed earlier. Complicating the problem is that these items form program-level constraints, but some parts of these could be function-level constraints. For now, the compiler builds only one such type of program (e.g. a single program per module, if memory serves). This may change in the future. Additionally complicating the problem is that there are some examples of sub-programs which may be driven by an executor, thus needing similar information. And finally, we have currently already baked much of this into Target via the executor parameters (which were recently split out but also the subject of this continuing RFC) and via target_host.
  - This RFC doesn’t need to resolve a proper way to model all possible program constraints, but if we are attempting to choose a way to model this constraint such that it can be reflected trivially into CompliationConfig, we should choose a system that can be easily extended to describe a flexible set of constraints, so that people adding new types of executor relations (e.g. creating sub-programs with Executor constraints, similar to the TVM Unity effort) aren’t hampered by this config system.
  - So long as we are able to build an extensible system, we could probably start with a Target equivalent which lacks a recurrence relation. It’s an open question how this should be reflected in disk configuration.
  - The struct which defines the whole-program constraint should probably continue to be called Target to avoid confusion. As we explore sub-program constraints, we may want to either extract pieces of Target into a common base class (at least the parts that handle the schema). It may be wise to extract Leaf Target information into a separate class with a better name.

cc @mbs-octoml

manupa-arm · January 25, 2022, 12:37pm

To be a bit pragmatic of progress here, I would propose lets do the minimum step that we are after is better representation of C1-typed information in the compilation flow.

areusch:

CompilationConfig shouldn’t be seen as replacing C0 (e.g. PassContext) or C1 (e.g. Target) style config. Andrew: it should be seen as a composite data structure containing a representation of both. This allows parallel compilation flows to be built and maintained independently of CompilationConfig, which certainly could be used elsewhere but is primarily motivated by tvmc.

As a starting point, it would be great to consider CompilationConfig as the method to specify configuration for the tvmc-based flow rather than as the singular way to configure all tvm.relay.build.

Andrew: A desirable property of CompilationConfig is that the pieces of the composite struct which correspond to the compiler internal configuration are trivial representations of the actual structures used internally.

For PassContext: this is essentially restricting the data types of the values and defining a serialization into an e.g. yaml or json map.

Could we leave this out to a seperate RFC to bring C0-stlyed information into it ? It is proving complex to solve all of this together.

I personally identify this is the step we want solve as the first step of many, therefore lets get this sorted .

We are fine as long as we dont use/overload the same data structure for both (leaf and non-leaf). We can discuss about what is a good name for this.

I agree with @areusch here that current state of TVM build only a single program and I would think this RFC does not block any further future RFCs that wishes to support multi program execution / partitioning.

I dont think we are mandating a “freeze” on the non-leaf target data structure in this RFC

Therefore, it would be wise for us to propose the extentions when and where such are proposed. As a community, we should try to discuss the levels of API and partitioning strategy which will nicely motivate the addition to the non-leaf Target to support multiple programs.

Me and @Mousius spent few cycles thinking about this… We reached the conclusion what we are after is the seperation of non-leaf target and leaf target. We have proposed here to call the former as CompilationConfig and latter to remain as target. However, after the discussion, it seems it also make sense to keep the non-leaf target as “Target” – if it is meaningful and reduces confusion – while we can look to rename the leaf target be something else (e.g. Backend).

@Mousius any more thoughts ?

areusch · April 6, 2022, 2:57pm

I discussed this with @tqchen, @junrushao, and @mbs-octoml. tl;dr we are broadly in agreement with this RFC and we think it can proceed.

This post will start by re-summarizing our understanding of the motivations for such an invasive IR change. Then, it will cover the controversial parts and explain the various approaches. Finally, it will summarize our opinions and conclude with our opinion of the best way forward.

This thread was incredibly long. Now that the format of the TVM Community Meeting has changed, I’d suggest we bring further discussion of large design changes like this one to those meetings for higher-bandwidth discussions.

Motivations for this Change

This RFC proposes to overhaul the way the TVM compiler is configured. The motivation behind this is to export the compiler configuration into a human-readable format (e.g. YAML) that can be consumed by a command-line tool (e.g. tvmc).

Additionally, there is a desire to place the full target configuration in the IRModule somewhere as an attribute so that it can be used in various passes (@Mousius and @manupa-arm, would be great to re-clarify this).

Classes of Configuration Affected by this Proposal

A discussion point that arose midway through this RFC is around the classification of configuration involved with this proposal. @tqchen proposed two classes:

C0. Configuration that directly specifies how some process in the compiler is carried out. It’s important to consider this in the abstract when understanding the motivations for the decisions here. In practice, it’s good to note here that in the codebase today, this roughly is PassContext.

C1. Configuration that specifies constraints on the compiler without giving a specific way to accommodate them. This configuration typically specifies properties of the deployment environment. The sentence in C0 about considering this in the abstract also applies here. In practice, it’s good to note here that in the codebase today, this roughly means Target.

Point of Clarification: this RFC is confined to C1-style config. A follow-on RFC may consider C0-style config.

What can be attached to an IRModule?

This RFC proposes that we attach the full CompilationConfig to an IRModule. Before the previous point was clarified, this was contentious. We discussed at length the question of what style of Configuration should be permitted to be attached to IRModules. The resolution was that there is consensus that C0-style confjg should not be attached to IRModules because it may create behavioral coupling between Passes which could be difficult to unit test. There is a strong desire to avoid coupling between Passes to keep them composable and retain flexibility in the compiler.

The result of this discussion was a decision that CompilationConfig itself should not be attached to an IRModule; rather, that C1-style config it contains (namely, the Target information) should be attached instead.

Why attach C1-style CompilationConfig to an IRModule?

There is one question unanswered in the previous section: what is the motivation for attaching C1-style CompilationConfig to IRModule? There are two points to make here:

There was a need by ARM folks to reference the Target from some passes [@mousius @manupa-arm it has now been so long since we discussed this I have forgotten which one required this—feel free to add it in]. Target is an object currently passed around the compiler on the stack as necessary. Last year, @jroesch began an effort to attach all of this “extra” (e.g. stack-passed information, or information tracked in flow-level compiler classes) to the IRModule during compilation. Target is yet another instance of this, so attaching it to the IRModule is the medium-term correct way to expose it to the pass ARM is trying to write.
The ultimate goal of this RFC is to expose the compiler’s configuration to tvmc users in a form that could be edited, serialized, and deserialized without needing to write Python or have a copy of the TVM source code. Since tvmc users have little visibility into the compiler source, it’s beneficial to eliminate any translations between the configuration they edit and the configuration accepted by the compiler. Attaching C1-style ComplationConfig (e.g. Target) directly to IRModule and referencing that as the authority on C1-style config accomplishes that goal.

Representation of Target

We now turn to the most contentious piece of debate: how should Target be represented? There are two types of Targets considered here:

Leaf targets. Identifies a single TVM backend (mapping to a single DLDevice at runtime) which, when used with the broader CompilationConfig, will generate functions which depend only on that device for execution.
Composite targets. Identifies a collection of Leaf Targets, one of which is considered the “host” (and therefore, which will host the Executor infrastructure).

Target is typically thought of as a parameter to tvm.relay.build. Currently, when a Leaf Target is passed to tvm.relay.build, it is promoted to a Composite Target with the “host” considered to be the same Leaf Target.

The contentious piece here was how to represent composite targets. Several options were proposed:

D0. Introduce “packaged” Target

This proposal suggests we introduce a new Target type:

{
  "kind": "packaged",
  "runtime": "crt",  
  "executor": “...”
  "target": {
    "kind": "cuda",   # the target that TIR generates to
    "host": {
      "kind": "llvm", # the codegen target for the host-side driver code
       ...
    }
  },
}

def tvm.target.packaged(
  target="cuda",
  executor="aot",
  runtime="crt",
): ...

The advantages to this option were:

It allows reuse of the Target schema infrastructure specified in src/target/target_kind.cc and friends.
It requires minimal effort to implement.
It is polymorphic—any attribute in an IRModule where a Target was required could be either a Leaf Target or a Composite Target. This means that where some flexibility was desired, the compiler could begin with a Composite Target and, via Passes, arrive at a Leaf Target. The example given here was in deciding where a Relay function should run.
Common needs such as in-memory repr for efforts such as Collage are already implemented.
No modification to [tvm.relay.build](http://tvm.relay.build) needed aside from adjustments to [Target.check_and_update_host_consist](https://github.com/apache/tvm/blob/main/python/tvm/target/target.py#L222)

The disadvantages to this option were:

Polymorphism can lead to confusion. When an attribute exists on a part of an IRModule which could be either Leaf or Composite Target, passes need to add extra logic to determine which kind of target is present. Asserting that an IRModule is well-formed is more difficult and could be a more difficult process for the programmer to understand.
It is presumed that tvmc-level configuration could be specified by more than one user. For example, a part of that configuration could be specified by the hardware vendor, and another part could be specified by the tvmc user. While it would be illegal for packaged Target to contain another packaged Target, such rules would need to be enforced by runtime logic rather than the type system. In a situation such as the one just posed, where multiple partial configurations exist and are combined to form a whole, it is vital that the user be able to understand the rules for combining partial configurations. Given the potential for infinite recursion allowed by the type system, those rules become difficult to specify.

D1. Adopt explicit LeafTarget and PackagedTarget classes

In this option, LeafTarget and PackagedTarget are represented by distinct classes which inherit from a common base class e.g. TargetBase. TargetBase is presumed to contain only infrastructure such as schema representation and in-memory repr functionality. It would not be considered to be a valid attribute type in the TVM compilation pass, with one exception: it would be valid for a single component to store TargetBase when:

It is not attached as TargetBase to an IRModule seen from another Pass.
It is convenient for that component to represent a flexible Leaf or Composite Target.

The proposal is sketched below:

class TargetBase:
    kind : str

class LeafTarget(Target):
    kind: str
    host: Optional[LeafTarget]
    …

class VirtualDevice:
    Target: Optional[LeafTarget]
    device_id: int

class PackagedTarget(Target):
    target: LeafTarget
    host: LeafTarget
    executor: Executor
    runtime: Runtime
    devices: List[VirtualDevice]

The advantages to this option are:

It allows reuse of the Target schema infrastructure specified in src/target/target_kind.cc and friends.
It requires minimal effort to implement.
It is explicit—there is no confusion between PackagedTarget and LeafTarget where attached to an IRModule.
Common needs such as in-memory repr for efforts such as Collage are already implemented.
No modification to [tvm.relay.build](http://tvm.relay.build) needed aside from adjustments to [Target.check_and_update_host_consist](https://github.com/apache/tvm/blob/main/python/tvm/target/target.py#L222). However, we could modify tvm.relay.build to take PackagedTarget only in a future update.

The disadvantages to this option are:

The kind field is present on the base class and could suggest polymorphic use in the code.
Polymorphic use needs to be disallowed in code review.

D2. Adopt separate PackagedTarget and LeafTargets without any common base class

This option fully separates the PackagedTarget and LeafTarget classes:

class LeafTarget:
    host: Optional[LeafTarget]

Target = LeafTarget

class VirtualDevice:
    Target: Optional[LeafTarget]
    device_id: int

class PackageConfig:
    host: LeafTarget
    executor: Executor
    runtime: Runtime
    devices: List[VirtualDevice]

The advantages to this option are:

It is explicit—there is no confusion between PackagedTarget and LeafTarget where attached to an IRModule.
The API to [tvm.relay.build](http://tvm.relay.build) could be made the most specific of all of the options.

The disadvantages to this option are:

Target schema and repr infrastructure needs to be re-implemented.
It requires a big lift that may be difficult/impossible to do in an incremental way.

Decision on Target Representation

We conclude that D1 is the best approach. It has the benefits of explicit typing on IRModule and in flow-level compiler classes while retaining flexibility which could prove useful in implementing future projects which may experiment with composite targets, such as Collage. Collage will discuss these efforts shortly at the TVM Community Meeting and in an RFC.

Example of Partial Configuration

Finally, an example of partial configuration, as it had bearing on the discussion:

my-soc.yaml:
tag: my-soc-base
target:
  kind: ethos
  memory-size: 128
host:
  kind: llvm
  mcpu: cortex-m33
runtime:
  Kind: c

app.yaml:
executor:
  Kind: aot

Our Conclusion

The RFC as proposed should not be in conflict with the consensus we reached. We prefer the implementation of the RFC to re-use the schema and in-memory repr infrastructure developed for Target by adopting a common base class. Only the PackagedTarget from CompilationConfig should be attached to the IRModule, leaving room to add PassContext to CompilationConfig in a future RFC.

Mousius · April 14, 2022, 2:50pm

@areusch thanks for coming back to this and working to get this resolved, unfortunately I think we’ve reached an impasse, which I’ll attempt to articulate further.

This is one of the initial motivations around this change, to support moving the BYOC infrastructure further into the core compiler as well as create a less dynamic approach to gathering Executor/Runtime/Targets from an IRModule given they should be non-optional. BYOC Targets are only known before relay.build, and have functions partitioned with Targets or kCompilers that can only be found on the graph nodes, due to much of it being implemented in tvmc. The RelayToTIR hook walks the graph looking for such annotations to reverse engineer this information. We can also see the need for multiple Targets in the Collage RFC.

If we could use one object for both context and constraints, that would be ideal; if we require the two types of configuration to be separated then it’d be better for tvmc to combine the PassContext and CompilationConfig using a higher level of abstraction (visible only in tvmc) rather than try to provide both levels of abstraction in one object. As such, I believe the tvmc configuration object can call CompilationConfig::FromJSONNode(node) or similar to process that portion of the object, this would be an improvement over the currently proposed variant of --config which is being added without CompilationConfig.

By using a common base class of Target, this change introduces further confusion in the Target system, which I evidenced as already being problematic above; by introducing PackagedTarget and LeafTarget we introduce even further new terminology and start using Target not only as a TVM target but also as a JSON format for other objects in the codebase. Given that the Target serialisation is straight-forward JSON, we should be able to encapsulate that in a JSON serialiser that enumerates the fields of configuration rather than using the Target purely for the side effect it can export JSON in the correct format.

Summarily, it’s counter to this proposal to create further Target confusion both internally for compiler engineers and externally for users; given the length of this thread I don’t believe this will be a short-term solution and is likely to promote further confusion as to the definition of Target. As I’m under the impression this currently blocks the TVM release, in the spirit of moving forwards, I would suggest we consider this RFC rejected and continue with current the relay.build API.

kparzysz · April 25, 2022, 9:20pm

Let me get back to this thread What is ‘target’ in TVM? for a moment. First of all, the fact that such a thread was started shows that there is a lack of clarity about what “target” really means, and the thread we’re in does little to address it. Andrew acknowledges this lack of clarity in his reply, and states that “target” is essentially the “deployment environment”. Problem is, that this is a concept far too rich to express it via a single definition.

I think we should reconsider the meaning of “target”.

I don’t think that anyone here opposes the idea of coalescing the “compilation configuration” into a single data object. Rather, the objections stem from the concept of “target” itself.

Target structure

There is hardware on which we want to execute some model. This hardware can consist of multiple components, each of which may have different instruction set, different operating system (or none at all), etc. When compiling for this hardware, the first thing we need to know is which part of the code will run on which component. This assignment can happen early on (in relay, as is the case with BYOC), or later on in TIR. The Collage work is (IMO) quite an elegant solution, which could be transferred (conceptually) to TIR.

The key information here is the structure of the hardware in terms of which component is connected to which other component, and what the capacities are of each component. This will decide what kinds of code can be compiled for, and executed on that component. The other question is whether given code should actually be executed there. So, what we need to know about each component is (1) capabilities, and (2) performance characteristics of each component. This is obviously in addition to (3) instruction set and operating system.

Components as units

The characteristics of each component are mostly self-contained, and independent from the presence or absence of other components, which suggests that components should be described independently from their placement in the hardware. Usually there will be a component that is more capable than others, and is the one that users interact with directly, although there is no reason to assume that this will always be the case: we can consider two computers connected via a network as the piece of hardware we want to run the model on.

Architecture

I propose that we separate the concept of the architecture (i.e. the arrangement of components) from the components themselves. What we previously called “packaged target” would map to the architecture description together with the description of each component in it.

We could then apply the term “target” to describe an individual component in an architecture. We wouldn’t need composite targets, or even “host target”.

For each component we could then describe either an external compiler (for use in BYOC), or internal code generator (for TIR compilation).

tqchen · April 26, 2022, 2:31am

Thanks @kparzysz . I think the main contention here is whether we acknowledge the need to effectively specify a group of “sub-components”.

When we say target as the fundamental “component”, an implicit assumption is that the particular component roughly comes with a grouped set of compilation(piplines) and they are not necessarily further divisible.

Logically, this view leads to first-class configuration of two things:

V0: The most top-level thing which is the PackagedTarget that specifies the entire package architecture.
V1: The most bottom-level thing which is LeafTarget, in some sense people might want to get rid of host to make it truely a leaf.

The two-level view makes most of the things easy for either V0 and V1.

The other view, emphasize that during compilation it is important to have configuration constraints for function-level, that goes beyond V1.

V2: A compositional constraints that contains “components” for a particular function.

Perhaps the above figure can illustrate the subtleness here.

Imagine we have a system whose host driver is x86, that contains three “components”, CUDA(which runs nvidia devices), cublas(for BYOC) and a remote XPU, that was a driver to a remote device, which again from remote’s pov was driven by a host(risc) and accelerator (accel).

The V0/V1 pov means that we only need to clarify the modular components – each rectangles is a component(V1). And the global architectural setting is the most top level package configurations V0.

A V2 level configuration corresponds to the dashed boxes here that covers the componets that the config intersects. For example:

A corresponds to a function configuration which is the most common setting, a target with host. This is effectively any CUDA function before host/device split.
B corresponds to a BYOC case, depending on the setting can also imply CUDA
C correspond to a case where host-device split is available structurally, and a further split on the remote function is also needed.

V2 effectively acknowledges the need of structurally represent a collection of componets and constraints needed to compile some functions – of course different V2 configs can have overlapped information as all functions need to have the same host for example to be able to call into each other.

A customization under V2 view is also not hard, as each of the sub-component grouping can have its own compilation rules(pipelines) that can leverage subsequent compilation pipelines of its own componet(e.g. CUDA with Host will leverage the host compiler pipeline and cuda pipeline accordingly). In the case of C. It will call into host pipeline, and remote-xpu pipeline, which in term decomposes and call into risc pipeline and accel pipeline.

So one of the main contention pt is how we do divide and conquer

A V0/V1 only view means divide and conquer in a two level way. Effectively de-compose V0 into V1s and solve each V1 separately
A V2 view would acknowledge that during divide and conquer we have sub-steps (for certain functions) that would look into a collection of components (with TargetWithHost being the most common example), and it is important to acknowledge that fact and cover these dashed cases (even though they can overlap and requires consistency checks when annotated on different functions).

junrushao · April 26, 2022, 5:25am

Might be off the topic, but I think @kparzysz has a valid point here:

If we don’t act to clarify the meaning of Target, I believe questions will continuously pop up.

Mousius · April 26, 2022, 12:18pm

I believe the architecture description you’ve described is essentially what CompilationConfig is at present (see: compilation_config.h), which contains List[Target], and VirtualDevices mapping those Target to Devices. In this way, multiple components can reside on a single device which allows Collage to select the best available combination, correct me if I’m wrong @mbs-octoml

I agree, we can remove most of the confusion around Target by adopting your concept of individual components rather than describing the architecture through them. Considering the Target Hooks RFC, I believe we can achieve this rationalisation and removal of BYOC, in favour of each component being described as a Target. You can see this taking form with our implementation of CMSIS-NN whereby it is actually a Target internally:

github.com

apache/tvm/blob/4dc47df369f3116f7674e474ea655b4c9e2e25ab/src/relay/backend/contrib/cmsisnn/target.cc#L33-L35


TVM_REGISTER_TARGET_KIND("cmsis-nn", kDLCPU)
    .set_attr<FTVMRelayToTIR>("RelayToTIR", RelayToTIR())
    .set_attr<FTVMTIRToRuntime>("TIRToRuntime", TIRToRuntime);

The information required to fully utilise it as a Target is currently lost in tvmc, which further motivates the need for the architecture description. Fully implementing the RelayToRuntime Target Hook would then mean that a Target can produce either TIR, TIR + Runtime module or Runtime modules directly from Relay - replacing the BYOC kCompiler flow over time.

kparzysz · April 26, 2022, 1:36pm

In my view, the “architecture” would be the horizontal boxes (i.e. “components”), plus edges indicating connectivity. The graph could be partitioned into connected[1] groups of components, and each such group could be described by the union of the properties of its components[2]. This partitioning wouldn’t need to be done manually, it could also be done dynamically by algorithms trying to match/partition the code to the underlying architecture. I think this would cover all cases V0, V1, and V2. I think it would also allow multiple approaches to code partitioning, whether it’s a two-level view, or function-based divide-and-conquer.

This may be nearly identical to the “LeafTarget” and “PackagedTarget”, but it makes it explicit that the partition (i.e. the “PackagedTarget”) is a derivative concept built from the architecture description (i.e. components and connections).

[1] Connected topologically, i.e. not having isolated sub-groups.

[2] Union may not be applicable to every type of properties, but the idea here is that it would be something that can be algorithmically determined.

tqchen · April 26, 2022, 1:57pm

Thanks @kparzysz What you said makes sense.

Effectively one point of view calls for a unified structure(base) would be needed to be able to configure through the divide and conquer transition through V0=> V2 => V1 phases of function optimizations. Which in your terminology means “Architecture”. I agree with that pt.

The V2 view mainly calls for a need of ''Architecture", which contains the components and connectivity that can represent :

V0 global set of configurations
V2: some configs that contains host with target
V1: the final leaf terminology where only really a single “target” in traditional compiler sense.

Given “Architecture” describes the relations on how things groups with each other in a hierarchical fashion. One possible option would be to adopt the current Target data structure (perhaps with a different name to differentiate from the leaf component), given the relation groupings usually are sub-trees.

Note that the naming itself is a separate issue that can be addressed independently (Personally I think architecture should be avoided mainly because it is already used in Arch field of LLVM’s target triple, which makes it a sub component of target (triple)), but it is a minor issue.

kparzysz · April 26, 2022, 1:57pm

Yes, definitely. I was trying to present an independent point of view, and so I was trying to avoid using terminology that was already in use in this thread.

areusch · April 27, 2022, 12:13am

Thanks all for these discussions. I agree with @kparzysz’s point that the architecture should be separated from the concept of a “component.” I had a similar thought in discussion with @Mousius last week that perhaps we should to formally name and define these concepts because they are complex and easy to confuse. We’ve had quite a few problems communicating about the overall desired outcome here because it’s difficult to know whether someone means “the conceptual idea of Target” or “the current realization of LeafTarget in the codebase” or “some partially-abstract base class for both architecture and component views.”

I think one thing that’s confusing about the current Target data structure is that the name of the structure is both:

a base class which provides schema and serialization
an abstract concept that vaguely describes the deployment environment

It might be useful to depart from the name Target here, since that seems to just be overloaded and vague at this point. I did have this thought:

LeafTarget → VirtualDevice::codegen (that is, actually require a VirtualDevice in place of LeafTarget, and include a field Codegen codegen which could describe a constraint on how the compiler may generate code for this device). Codegen is really what LeafTarget::kind indicates, and we’ve sanctioned that word via the Bring Your Own Codegen name. Sure, there are other things that are implied by including a codegen into the description of the deploy environment constraints, but ultimately the main thing described within the bounds of the Codegen data structure are properties of the codegen itself. You could construct a VirtualDevice with only a Codegen specified, and then this would lend itself better to the refactor asked for by Artifact where we allow users to name VirtualDevices.

I don’t have great thoughts on the others yet. Half-baked ideas…

PackagedTarget → ? Thought for while here and still not sure. CompositeDeployment or Deployment or DeployEnvironment.
Target/TargetBase → DeployConstraint or TargetSchema or something.

However, the general thing i’m going for here is to tighten the scopes/definitions so that we can make progress here. We can always add new concepts as we build out support for them.

I agree we might be able to reuse the conceptual data structure. In reusing the current Target data structures, the opportunity could arise to introduce ambiguity in the tree:

class HeterogenousDeployEnvironment : public TargetBase {
  // What does "target" in "target_host" mean? 
  // What kind of TargetBase should be filled in here?
  TargetBase target_host;
}

Here we’ve repeated the name “target” a few times and made it unclear how to fill in the data structure. If we are to reuse such an ambiguous structure, I believe that we should avoid ambiguity so it’s clear how we intend for people to use it.

tqchen · April 27, 2022, 2:50pm

Thanks @areusch , to further build on your comment.

The main property that we want to preserve (from the current target system) is a common base class of possible configurations that present V2, and depending on how the dashed box is circled it can range from a singleton (e.g. device only CUDA), a part of the composite (with the most common case being TargetWithHost), and the entirety of V1.

To build on the recommendation that leaf components being separated and give an example under @kparzysz 's terminology (Architecture being the layout and Target being the component – leaving out the naming itself for now.

// No virtual device is needed as compilation for TIR function
// is generally applicable to any virtual device
class DeviceOnlyArch : public Architecture {
  public:
   Target device;
};

class DeviceWithHostArch : public Architecture {
  public:
   Target device;
   Optional[Target] host;
};

// Virtual device needed for graph level runtime information validation 
class PackagedArch : public Architecture {
  public:
   List[VirtualDevice] devices;
   Target host;
   Runtime runtime;
   Executor executor;
};

Note that different architecture itself certainly will result in different compilation pipeline that can be decomposed into some of the sub-architectures – as a result dispatching on the kind or structured view is helpful here.

Depending on the phase of compilations and their state, a function can sit at different level of constraints(Architectures), specifying the deployment constraints(and hints about information) about that function, ranging from PackagedArch to DeviceWithHostArch, then finally DeviceOnlyArch.

In an original view, an Architecture itself can be any meaningfully grouped subtree in the global settings, as a result, the leaf itself can also be viewed as a subtree. That was the original rationale of the Target system and personally I do not find a strong difference between the two. But I also acknowledge the advantage to be able to separate out leafs as them being special. The main thing to preserve is the ability to specify architecture(of subtree) through out our divide and conquer process of compilation.

areusch · April 27, 2022, 2:56pm

Just to be clear about each of these cases–could we explicitly state their uses in the thread so everyone is on the same page? I think there might be questions about why you’d ever pass DeviceOnlyArch to tvm.relay.build().