[pre-RFC] Compilation Configuration Representation

tqchen · January 22, 2022, 12:08am

Thanks @Mousius . I don;t think we need to settle down on mechanisms for C0 configuration. The original intention of the RFC appeared to be C1 style but then the discussion drove it towards C0 + C1 style.

So I agree that we should figure out the data structure choice for C1 style configuration in this post.

We all understand and agree the possible advantages bought up a single point setup. Just that there are also other side consideration in terms of compositionality, consistency, and extensibilities. as some of the posts bought up.

The suggstion of C2 style CompilationConfig that translates into C0 and C1 style at lower-level actually is meant to serve as reconciliation here that learns from the previous precedence in deep learning frameworks.

Mousius · January 24, 2022, 12:15pm

There is some need to name a specific of Targets configuration, which is present in the CompilationConfig and matches to how the TVM partitioning pipeline currently behaves without the configuration (it is already planned to use named configurations in tvmc and we’re not planning to remove Target tagging). I can’t see the lines you’re referring to as motivating though:

github.com

apache/tvm/blob/813136401a11a49d6c15e6013c34dd822a5c4ff6/src/target/tag.cc#L73-L81


#define TVM_REGISTER_CUDA_TAG(Name, Arch, SharedMem, RegPerBlock) \
  TVM_REGISTER_TARGET_TAG(Name).set_config({                      \
      {"kind", String("cuda")},                                   \
      {"arch", String(Arch)},                                     \
      {"shared_memory_per_block", Integer(SharedMem)},            \
      {"registers_per_block", Integer(RegPerBlock)},              \
      {"max_threads_per_block", Integer(1024)},                   \
      {"thread_warp_size", Integer(32)},                          \
  });

There is precedence for using the list of Targets successfully in tvmcs Target parsing infrastructure and as part of a multi-Target workflow (see: microNPU Demo) which is being actively used and extended to incorporate multiple devices and Targets.

This is also currently supported as part of the existing BYOC flow currently in use for multiple Targets as evidenced by the aforementioned demo where the microNPU is configured separately. Extending this further is definitely something to explore, but given the functionality exists today it is unnecessary to do it as part of this iteration.

Based on the evidence presented in this post, the current behaviour of the TVM codebase is demonstrated, showing that CompilationConfig is a step towards further supporting the features which already exist without the need for an increase in complexity for Target. Introducing a different mechanism for this is unnecessary given the existing functionality and this RFC is in fact a small iteration on the existing approach to codify the behaviour.

Target with-in this implementation is similar to a dict with a schema? Why is this favoured over an explicitly typed class which has the same benefits but additionally has compile time checks and clearly documented form in code? As for serialization, the encapsulation of auto tuning serialization logic into a configuration object would be a clear boundary for auto tuning to work from which can still invoke the various Targets serialization functions. I don’t see this as any additional effort over implementing such a function for a packaged Target.

In the post What is ‘target’ in TVM? it is clearly demonstrated that this overloaded use of the term Target does not create clarity but introduces further confusion to both users and developers. This has also been articulated over the previous posts.

areusch · January 25, 2022, 8:09am

Discussed further to understand @junrushao and @tqchen’s perspective today. Here are some notes:

Summarizing TQ/Junru’s proposed config philosophy: Compiler configuration can be categorized into two rough categories:
- C0: Configuration which describes the method by which a part of the compiler should work. Most often, this config affects only one pass and can be thought of as configuration specific to the Pass. Currently we store this in PassContext. Most (all?) C0-style configuration should not be used from one pass to configure a downstream pass.
- C1: Configuration which describes the environment where the generated code will run. Here, based on precedent from other deep learning frameworks (@tqchen can cite the specifics), we prefer to express this in terms of a set of constraints. Based on the precedent, constraints are a good representation here because it allows the compiler internals to remain as flexible as possible. In order to allow the compiler to continue to evolve and compose as new libraries, targets, and deployment strategies are added.
  - Because these are constraints, we should choose a representation that specifies the range of each axis. Passes should be viewed as a set of processing steps which may modify (typically, narrowing) the range of each axis. When compilation is done, the compiler generates code to match the remaining constraints.
One of the goals of CompilationConfig is to consolidate compiler configuration so it’s easy to manipulate on disk and simple to understand the supported configuration.
- CompilationConfig shouldn’t be seen as replacing C0 (e.g. PassContext) or C1 (e.g. Target) style config. Andrew: it should be seen as a composite data structure containing a representation of both. This allows parallel compilation flows to be built and maintained independently of CompilationConfig, which certainly could be used elsewhere but is primarily motivated by tvmc.
- As a starting point, it would be great to consider CompilationConfig as the method to specify configuration for the tvmc-based flow rather than as the singular way to configure all tvm.relay.build.
- Andrew: A desirable property of CompilationConfig is that the pieces of the composite struct which correspond to the compiler internal configuration are trivial representations of the actual structures used internally.
- For PassContext: this is essentially restricting the data types of the values and defining a serialization into an e.g. yaml or json map.
- For Target: this gets more complex. We can split this problem into a couple parts:
  - Let us define a Leaf Target as that subset of a Target configuration specific to a single Target subclass. In other words, exclude any relation between targets and the Executor and Runtime configuration. This part is essentially a schema’d version of the PassContext problem.
  - More complex are the Executor, Runtime, and “packaged” Target proposals discussed earlier. Complicating the problem is that these items form program-level constraints, but some parts of these could be function-level constraints. For now, the compiler builds only one such type of program (e.g. a single program per module, if memory serves). This may change in the future. Additionally complicating the problem is that there are some examples of sub-programs which may be driven by an executor, thus needing similar information. And finally, we have currently already baked much of this into Target via the executor parameters (which were recently split out but also the subject of this continuing RFC) and via target_host.
  - This RFC doesn’t need to resolve a proper way to model all possible program constraints, but if we are attempting to choose a way to model this constraint such that it can be reflected trivially into CompliationConfig, we should choose a system that can be easily extended to describe a flexible set of constraints, so that people adding new types of executor relations (e.g. creating sub-programs with Executor constraints, similar to the TVM Unity effort) aren’t hampered by this config system.
  - So long as we are able to build an extensible system, we could probably start with a Target equivalent which lacks a recurrence relation. It’s an open question how this should be reflected in disk configuration.
  - The struct which defines the whole-program constraint should probably continue to be called Target to avoid confusion. As we explore sub-program constraints, we may want to either extract pieces of Target into a common base class (at least the parts that handle the schema). It may be wise to extract Leaf Target information into a separate class with a better name.

cc @mbs-octoml

manupa-arm · January 25, 2022, 12:37pm

To be a bit pragmatic of progress here, I would propose lets do the minimum step that we are after is better representation of C1-typed information in the compilation flow.

areusch:

CompilationConfig shouldn’t be seen as replacing C0 (e.g. PassContext) or C1 (e.g. Target) style config. Andrew: it should be seen as a composite data structure containing a representation of both. This allows parallel compilation flows to be built and maintained independently of CompilationConfig, which certainly could be used elsewhere but is primarily motivated by tvmc.

As a starting point, it would be great to consider CompilationConfig as the method to specify configuration for the tvmc-based flow rather than as the singular way to configure all tvm.relay.build.

Andrew: A desirable property of CompilationConfig is that the pieces of the composite struct which correspond to the compiler internal configuration are trivial representations of the actual structures used internally.

For PassContext: this is essentially restricting the data types of the values and defining a serialization into an e.g. yaml or json map.

Could we leave this out to a seperate RFC to bring C0-stlyed information into it ? It is proving complex to solve all of this together.

I personally identify this is the step we want solve as the first step of many, therefore lets get this sorted .

We are fine as long as we dont use/overload the same data structure for both (leaf and non-leaf). We can discuss about what is a good name for this.

I agree with @areusch here that current state of TVM build only a single program and I would think this RFC does not block any further future RFCs that wishes to support multi program execution / partitioning.

I dont think we are mandating a “freeze” on the non-leaf target data structure in this RFC

Therefore, it would be wise for us to propose the extentions when and where such are proposed. As a community, we should try to discuss the levels of API and partitioning strategy which will nicely motivate the addition to the non-leaf Target to support multiple programs.

Me and @Mousius spent few cycles thinking about this… We reached the conclusion what we are after is the seperation of non-leaf target and leaf target. We have proposed here to call the former as CompilationConfig and latter to remain as target. However, after the discussion, it seems it also make sense to keep the non-leaf target as “Target” – if it is meaningful and reduces confusion – while we can look to rename the leaf target be something else (e.g. Backend).

@Mousius any more thoughts ?

areusch · April 6, 2022, 2:57pm

I discussed this with @tqchen, @junrushao, and @mbs-octoml. tl;dr we are broadly in agreement with this RFC and we think it can proceed.

This post will start by re-summarizing our understanding of the motivations for such an invasive IR change. Then, it will cover the controversial parts and explain the various approaches. Finally, it will summarize our opinions and conclude with our opinion of the best way forward.

This thread was incredibly long. Now that the format of the TVM Community Meeting has changed, I’d suggest we bring further discussion of large design changes like this one to those meetings for higher-bandwidth discussions.

Motivations for this Change

This RFC proposes to overhaul the way the TVM compiler is configured. The motivation behind this is to export the compiler configuration into a human-readable format (e.g. YAML) that can be consumed by a command-line tool (e.g. tvmc).

Additionally, there is a desire to place the full target configuration in the IRModule somewhere as an attribute so that it can be used in various passes (@Mousius and @manupa-arm, would be great to re-clarify this).

Classes of Configuration Affected by this Proposal

A discussion point that arose midway through this RFC is around the classification of configuration involved with this proposal. @tqchen proposed two classes:

C0. Configuration that directly specifies how some process in the compiler is carried out. It’s important to consider this in the abstract when understanding the motivations for the decisions here. In practice, it’s good to note here that in the codebase today, this roughly is PassContext.

C1. Configuration that specifies constraints on the compiler without giving a specific way to accommodate them. This configuration typically specifies properties of the deployment environment. The sentence in C0 about considering this in the abstract also applies here. In practice, it’s good to note here that in the codebase today, this roughly means Target.

Point of Clarification: this RFC is confined to C1-style config. A follow-on RFC may consider C0-style config.

What can be attached to an IRModule?

This RFC proposes that we attach the full CompilationConfig to an IRModule. Before the previous point was clarified, this was contentious. We discussed at length the question of what style of Configuration should be permitted to be attached to IRModules. The resolution was that there is consensus that C0-style confjg should not be attached to IRModules because it may create behavioral coupling between Passes which could be difficult to unit test. There is a strong desire to avoid coupling between Passes to keep them composable and retain flexibility in the compiler.

The result of this discussion was a decision that CompilationConfig itself should not be attached to an IRModule; rather, that C1-style config it contains (namely, the Target information) should be attached instead.

Why attach C1-style CompilationConfig to an IRModule?

There is one question unanswered in the previous section: what is the motivation for attaching C1-style CompilationConfig to IRModule? There are two points to make here:

There was a need by ARM folks to reference the Target from some passes [@mousius @manupa-arm it has now been so long since we discussed this I have forgotten which one required this—feel free to add it in]. Target is an object currently passed around the compiler on the stack as necessary. Last year, @jroesch began an effort to attach all of this “extra” (e.g. stack-passed information, or information tracked in flow-level compiler classes) to the IRModule during compilation. Target is yet another instance of this, so attaching it to the IRModule is the medium-term correct way to expose it to the pass ARM is trying to write.
The ultimate goal of this RFC is to expose the compiler’s configuration to tvmc users in a form that could be edited, serialized, and deserialized without needing to write Python or have a copy of the TVM source code. Since tvmc users have little visibility into the compiler source, it’s beneficial to eliminate any translations between the configuration they edit and the configuration accepted by the compiler. Attaching C1-style ComplationConfig (e.g. Target) directly to IRModule and referencing that as the authority on C1-style config accomplishes that goal.

Representation of Target

We now turn to the most contentious piece of debate: how should Target be represented? There are two types of Targets considered here:

Leaf targets. Identifies a single TVM backend (mapping to a single DLDevice at runtime) which, when used with the broader CompilationConfig, will generate functions which depend only on that device for execution.
Composite targets. Identifies a collection of Leaf Targets, one of which is considered the “host” (and therefore, which will host the Executor infrastructure).

Target is typically thought of as a parameter to tvm.relay.build. Currently, when a Leaf Target is passed to tvm.relay.build, it is promoted to a Composite Target with the “host” considered to be the same Leaf Target.

The contentious piece here was how to represent composite targets. Several options were proposed:

D0. Introduce “packaged” Target

This proposal suggests we introduce a new Target type:

{
  "kind": "packaged",
  "runtime": "crt",  
  "executor": “...”
  "target": {
    "kind": "cuda",   # the target that TIR generates to
    "host": {
      "kind": "llvm", # the codegen target for the host-side driver code
       ...
    }
  },
}

def tvm.target.packaged(
  target="cuda",
  executor="aot",
  runtime="crt",
): ...

The advantages to this option were:

It allows reuse of the Target schema infrastructure specified in src/target/target_kind.cc and friends.
It requires minimal effort to implement.
It is polymorphic—any attribute in an IRModule where a Target was required could be either a Leaf Target or a Composite Target. This means that where some flexibility was desired, the compiler could begin with a Composite Target and, via Passes, arrive at a Leaf Target. The example given here was in deciding where a Relay function should run.
Common needs such as in-memory repr for efforts such as Collage are already implemented.
No modification to [tvm.relay.build](http://tvm.relay.build) needed aside from adjustments to [Target.check_and_update_host_consist](https://github.com/apache/tvm/blob/main/python/tvm/target/target.py#L222)

The disadvantages to this option were:

Polymorphism can lead to confusion. When an attribute exists on a part of an IRModule which could be either Leaf or Composite Target, passes need to add extra logic to determine which kind of target is present. Asserting that an IRModule is well-formed is more difficult and could be a more difficult process for the programmer to understand.
It is presumed that tvmc-level configuration could be specified by more than one user. For example, a part of that configuration could be specified by the hardware vendor, and another part could be specified by the tvmc user. While it would be illegal for packaged Target to contain another packaged Target, such rules would need to be enforced by runtime logic rather than the type system. In a situation such as the one just posed, where multiple partial configurations exist and are combined to form a whole, it is vital that the user be able to understand the rules for combining partial configurations. Given the potential for infinite recursion allowed by the type system, those rules become difficult to specify.

D1. Adopt explicit LeafTarget and PackagedTarget classes

In this option, LeafTarget and PackagedTarget are represented by distinct classes which inherit from a common base class e.g. TargetBase. TargetBase is presumed to contain only infrastructure such as schema representation and in-memory repr functionality. It would not be considered to be a valid attribute type in the TVM compilation pass, with one exception: it would be valid for a single component to store TargetBase when:

It is not attached as TargetBase to an IRModule seen from another Pass.
It is convenient for that component to represent a flexible Leaf or Composite Target.

The proposal is sketched below:

class TargetBase:
    kind : str

class LeafTarget(Target):
    kind: str
    host: Optional[LeafTarget]
    …

class VirtualDevice:
    Target: Optional[LeafTarget]
    device_id: int

class PackagedTarget(Target):
    target: LeafTarget
    host: LeafTarget
    executor: Executor
    runtime: Runtime
    devices: List[VirtualDevice]

The advantages to this option are:

It allows reuse of the Target schema infrastructure specified in src/target/target_kind.cc and friends.
It requires minimal effort to implement.
It is explicit—there is no confusion between PackagedTarget and LeafTarget where attached to an IRModule.
Common needs such as in-memory repr for efforts such as Collage are already implemented.
No modification to [tvm.relay.build](http://tvm.relay.build) needed aside from adjustments to [Target.check_and_update_host_consist](https://github.com/apache/tvm/blob/main/python/tvm/target/target.py#L222). However, we could modify tvm.relay.build to take PackagedTarget only in a future update.

The disadvantages to this option are:

The kind field is present on the base class and could suggest polymorphic use in the code.
Polymorphic use needs to be disallowed in code review.

D2. Adopt separate PackagedTarget and LeafTargets without any common base class

This option fully separates the PackagedTarget and LeafTarget classes:

class LeafTarget:
    host: Optional[LeafTarget]

Target = LeafTarget

class VirtualDevice:
    Target: Optional[LeafTarget]
    device_id: int

class PackageConfig:
    host: LeafTarget
    executor: Executor
    runtime: Runtime
    devices: List[VirtualDevice]

The advantages to this option are:

It is explicit—there is no confusion between PackagedTarget and LeafTarget where attached to an IRModule.
The API to [tvm.relay.build](http://tvm.relay.build) could be made the most specific of all of the options.

The disadvantages to this option are:

Target schema and repr infrastructure needs to be re-implemented.
It requires a big lift that may be difficult/impossible to do in an incremental way.

Decision on Target Representation

We conclude that D1 is the best approach. It has the benefits of explicit typing on IRModule and in flow-level compiler classes while retaining flexibility which could prove useful in implementing future projects which may experiment with composite targets, such as Collage. Collage will discuss these efforts shortly at the TVM Community Meeting and in an RFC.

Example of Partial Configuration

Finally, an example of partial configuration, as it had bearing on the discussion:

my-soc.yaml:
tag: my-soc-base
target:
  kind: ethos
  memory-size: 128
host:
  kind: llvm
  mcpu: cortex-m33
runtime:
  Kind: c

app.yaml:
executor:
  Kind: aot

Our Conclusion

The RFC as proposed should not be in conflict with the consensus we reached. We prefer the implementation of the RFC to re-use the schema and in-memory repr infrastructure developed for Target by adopting a common base class. Only the PackagedTarget from CompilationConfig should be attached to the IRModule, leaving room to add PassContext to CompilationConfig in a future RFC.

Mousius · April 14, 2022, 2:50pm

@areusch thanks for coming back to this and working to get this resolved, unfortunately I think we’ve reached an impasse, which I’ll attempt to articulate further.

This is one of the initial motivations around this change, to support moving the BYOC infrastructure further into the core compiler as well as create a less dynamic approach to gathering Executor/Runtime/Targets from an IRModule given they should be non-optional. BYOC Targets are only known before relay.build, and have functions partitioned with Targets or kCompilers that can only be found on the graph nodes, due to much of it being implemented in tvmc. The RelayToTIR hook walks the graph looking for such annotations to reverse engineer this information. We can also see the need for multiple Targets in the Collage RFC.

If we could use one object for both context and constraints, that would be ideal; if we require the two types of configuration to be separated then it’d be better for tvmc to combine the PassContext and CompilationConfig using a higher level of abstraction (visible only in tvmc) rather than try to provide both levels of abstraction in one object. As such, I believe the tvmc configuration object can call CompilationConfig::FromJSONNode(node) or similar to process that portion of the object, this would be an improvement over the currently proposed variant of --config which is being added without CompilationConfig.

By using a common base class of Target, this change introduces further confusion in the Target system, which I evidenced as already being problematic above; by introducing PackagedTarget and LeafTarget we introduce even further new terminology and start using Target not only as a TVM target but also as a JSON format for other objects in the codebase. Given that the Target serialisation is straight-forward JSON, we should be able to encapsulate that in a JSON serialiser that enumerates the fields of configuration rather than using the Target purely for the side effect it can export JSON in the correct format.

Summarily, it’s counter to this proposal to create further Target confusion both internally for compiler engineers and externally for users; given the length of this thread I don’t believe this will be a short-term solution and is likely to promote further confusion as to the definition of Target. As I’m under the impression this currently blocks the TVM release, in the spirit of moving forwards, I would suggest we consider this RFC rejected and continue with current the relay.build API.

kparzysz · April 25, 2022, 9:20pm

Let me get back to this thread What is ‘target’ in TVM? for a moment. First of all, the fact that such a thread was started shows that there is a lack of clarity about what “target” really means, and the thread we’re in does little to address it. Andrew acknowledges this lack of clarity in his reply, and states that “target” is essentially the “deployment environment”. Problem is, that this is a concept far too rich to express it via a single definition.

I think we should reconsider the meaning of “target”.

I don’t think that anyone here opposes the idea of coalescing the “compilation configuration” into a single data object. Rather, the objections stem from the concept of “target” itself.

Target structure

There is hardware on which we want to execute some model. This hardware can consist of multiple components, each of which may have different instruction set, different operating system (or none at all), etc. When compiling for this hardware, the first thing we need to know is which part of the code will run on which component. This assignment can happen early on (in relay, as is the case with BYOC), or later on in TIR. The Collage work is (IMO) quite an elegant solution, which could be transferred (conceptually) to TIR.

The key information here is the structure of the hardware in terms of which component is connected to which other component, and what the capacities are of each component. This will decide what kinds of code can be compiled for, and executed on that component. The other question is whether given code should actually be executed there. So, what we need to know about each component is (1) capabilities, and (2) performance characteristics of each component. This is obviously in addition to (3) instruction set and operating system.

Components as units

The characteristics of each component are mostly self-contained, and independent from the presence or absence of other components, which suggests that components should be described independently from their placement in the hardware. Usually there will be a component that is more capable than others, and is the one that users interact with directly, although there is no reason to assume that this will always be the case: we can consider two computers connected via a network as the piece of hardware we want to run the model on.

Architecture

I propose that we separate the concept of the architecture (i.e. the arrangement of components) from the components themselves. What we previously called “packaged target” would map to the architecture description together with the description of each component in it.

We could then apply the term “target” to describe an individual component in an architecture. We wouldn’t need composite targets, or even “host target”.

For each component we could then describe either an external compiler (for use in BYOC), or internal code generator (for TIR compilation).

tqchen · April 26, 2022, 2:31am

Thanks @kparzysz . I think the main contention here is whether we acknowledge the need to effectively specify a group of “sub-components”.

When we say target as the fundamental “component”, an implicit assumption is that the particular component roughly comes with a grouped set of compilation(piplines) and they are not necessarily further divisible.

Logically, this view leads to first-class configuration of two things:

V0: The most top-level thing which is the PackagedTarget that specifies the entire package architecture.
V1: The most bottom-level thing which is LeafTarget, in some sense people might want to get rid of host to make it truely a leaf.

The two-level view makes most of the things easy for either V0 and V1.

The other view, emphasize that during compilation it is important to have configuration constraints for function-level, that goes beyond V1.

V2: A compositional constraints that contains “components” for a particular function.

Perhaps the above figure can illustrate the subtleness here.

Imagine we have a system whose host driver is x86, that contains three “components”, CUDA(which runs nvidia devices), cublas(for BYOC) and a remote XPU, that was a driver to a remote device, which again from remote’s pov was driven by a host(risc) and accelerator (accel).

The V0/V1 pov means that we only need to clarify the modular components – each rectangles is a component(V1). And the global architectural setting is the most top level package configurations V0.

A V2 level configuration corresponds to the dashed boxes here that covers the componets that the config intersects. For example:

A corresponds to a function configuration which is the most common setting, a target with host. This is effectively any CUDA function before host/device split.
B corresponds to a BYOC case, depending on the setting can also imply CUDA
C correspond to a case where host-device split is available structurally, and a further split on the remote function is also needed.

V2 effectively acknowledges the need of structurally represent a collection of componets and constraints needed to compile some functions – of course different V2 configs can have overlapped information as all functions need to have the same host for example to be able to call into each other.

A customization under V2 view is also not hard, as each of the sub-component grouping can have its own compilation rules(pipelines) that can leverage subsequent compilation pipelines of its own componet(e.g. CUDA with Host will leverage the host compiler pipeline and cuda pipeline accordingly). In the case of C. It will call into host pipeline, and remote-xpu pipeline, which in term decomposes and call into risc pipeline and accel pipeline.

So one of the main contention pt is how we do divide and conquer

A V0/V1 only view means divide and conquer in a two level way. Effectively de-compose V0 into V1s and solve each V1 separately
A V2 view would acknowledge that during divide and conquer we have sub-steps (for certain functions) that would look into a collection of components (with TargetWithHost being the most common example), and it is important to acknowledge that fact and cover these dashed cases (even though they can overlap and requires consistency checks when annotated on different functions).

junrushao · April 26, 2022, 5:25am

Might be off the topic, but I think @kparzysz has a valid point here:

If we don’t act to clarify the meaning of Target, I believe questions will continuously pop up.

Mousius · April 26, 2022, 12:18pm

I believe the architecture description you’ve described is essentially what CompilationConfig is at present (see: compilation_config.h), which contains List[Target], and VirtualDevices mapping those Target to Devices. In this way, multiple components can reside on a single device which allows Collage to select the best available combination, correct me if I’m wrong @mbs-octoml

I agree, we can remove most of the confusion around Target by adopting your concept of individual components rather than describing the architecture through them. Considering the Target Hooks RFC, I believe we can achieve this rationalisation and removal of BYOC, in favour of each component being described as a Target. You can see this taking form with our implementation of CMSIS-NN whereby it is actually a Target internally:

github.com

apache/tvm/blob/4dc47df369f3116f7674e474ea655b4c9e2e25ab/src/relay/backend/contrib/cmsisnn/target.cc#L33-L35


TVM_REGISTER_TARGET_KIND("cmsis-nn", kDLCPU)
    .set_attr<FTVMRelayToTIR>("RelayToTIR", RelayToTIR())
    .set_attr<FTVMTIRToRuntime>("TIRToRuntime", TIRToRuntime);

The information required to fully utilise it as a Target is currently lost in tvmc, which further motivates the need for the architecture description. Fully implementing the RelayToRuntime Target Hook would then mean that a Target can produce either TIR, TIR + Runtime module or Runtime modules directly from Relay - replacing the BYOC kCompiler flow over time.

kparzysz · April 26, 2022, 1:36pm

In my view, the “architecture” would be the horizontal boxes (i.e. “components”), plus edges indicating connectivity. The graph could be partitioned into connected[1] groups of components, and each such group could be described by the union of the properties of its components[2]. This partitioning wouldn’t need to be done manually, it could also be done dynamically by algorithms trying to match/partition the code to the underlying architecture. I think this would cover all cases V0, V1, and V2. I think it would also allow multiple approaches to code partitioning, whether it’s a two-level view, or function-based divide-and-conquer.

This may be nearly identical to the “LeafTarget” and “PackagedTarget”, but it makes it explicit that the partition (i.e. the “PackagedTarget”) is a derivative concept built from the architecture description (i.e. components and connections).

[1] Connected topologically, i.e. not having isolated sub-groups.

[2] Union may not be applicable to every type of properties, but the idea here is that it would be something that can be algorithmically determined.

tqchen · April 26, 2022, 1:57pm

Thanks @kparzysz What you said makes sense.

Effectively one point of view calls for a unified structure(base) would be needed to be able to configure through the divide and conquer transition through V0=> V2 => V1 phases of function optimizations. Which in your terminology means “Architecture”. I agree with that pt.

The V2 view mainly calls for a need of ''Architecture", which contains the components and connectivity that can represent :

V0 global set of configurations
V2: some configs that contains host with target
V1: the final leaf terminology where only really a single “target” in traditional compiler sense.

Given “Architecture” describes the relations on how things groups with each other in a hierarchical fashion. One possible option would be to adopt the current Target data structure (perhaps with a different name to differentiate from the leaf component), given the relation groupings usually are sub-trees.

Note that the naming itself is a separate issue that can be addressed independently (Personally I think architecture should be avoided mainly because it is already used in Arch field of LLVM’s target triple, which makes it a sub component of target (triple)), but it is a minor issue.

kparzysz · April 26, 2022, 1:57pm

Yes, definitely. I was trying to present an independent point of view, and so I was trying to avoid using terminology that was already in use in this thread.

areusch · April 27, 2022, 12:13am

Thanks all for these discussions. I agree with @kparzysz’s point that the architecture should be separated from the concept of a “component.” I had a similar thought in discussion with @Mousius last week that perhaps we should to formally name and define these concepts because they are complex and easy to confuse. We’ve had quite a few problems communicating about the overall desired outcome here because it’s difficult to know whether someone means “the conceptual idea of Target” or “the current realization of LeafTarget in the codebase” or “some partially-abstract base class for both architecture and component views.”

I think one thing that’s confusing about the current Target data structure is that the name of the structure is both:

a base class which provides schema and serialization
an abstract concept that vaguely describes the deployment environment

It might be useful to depart from the name Target here, since that seems to just be overloaded and vague at this point. I did have this thought:

LeafTarget → VirtualDevice::codegen (that is, actually require a VirtualDevice in place of LeafTarget, and include a field Codegen codegen which could describe a constraint on how the compiler may generate code for this device). Codegen is really what LeafTarget::kind indicates, and we’ve sanctioned that word via the Bring Your Own Codegen name. Sure, there are other things that are implied by including a codegen into the description of the deploy environment constraints, but ultimately the main thing described within the bounds of the Codegen data structure are properties of the codegen itself. You could construct a VirtualDevice with only a Codegen specified, and then this would lend itself better to the refactor asked for by Artifact where we allow users to name VirtualDevices.

I don’t have great thoughts on the others yet. Half-baked ideas…

PackagedTarget → ? Thought for while here and still not sure. CompositeDeployment or Deployment or DeployEnvironment.
Target/TargetBase → DeployConstraint or TargetSchema or something.

However, the general thing i’m going for here is to tighten the scopes/definitions so that we can make progress here. We can always add new concepts as we build out support for them.

I agree we might be able to reuse the conceptual data structure. In reusing the current Target data structures, the opportunity could arise to introduce ambiguity in the tree:

class HeterogenousDeployEnvironment : public TargetBase {
  // What does "target" in "target_host" mean? 
  // What kind of TargetBase should be filled in here?
  TargetBase target_host;
}

Here we’ve repeated the name “target” a few times and made it unclear how to fill in the data structure. If we are to reuse such an ambiguous structure, I believe that we should avoid ambiguity so it’s clear how we intend for people to use it.

tqchen · April 27, 2022, 2:50pm

Thanks @areusch , to further build on your comment.

The main property that we want to preserve (from the current target system) is a common base class of possible configurations that present V2, and depending on how the dashed box is circled it can range from a singleton (e.g. device only CUDA), a part of the composite (with the most common case being TargetWithHost), and the entirety of V1.

To build on the recommendation that leaf components being separated and give an example under @kparzysz 's terminology (Architecture being the layout and Target being the component – leaving out the naming itself for now.

// No virtual device is needed as compilation for TIR function
// is generally applicable to any virtual device
class DeviceOnlyArch : public Architecture {
  public:
   Target device;
};

class DeviceWithHostArch : public Architecture {
  public:
   Target device;
   Optional[Target] host;
};

// Virtual device needed for graph level runtime information validation 
class PackagedArch : public Architecture {
  public:
   List[VirtualDevice] devices;
   Target host;
   Runtime runtime;
   Executor executor;
};

Note that different architecture itself certainly will result in different compilation pipeline that can be decomposed into some of the sub-architectures – as a result dispatching on the kind or structured view is helpful here.

Depending on the phase of compilations and their state, a function can sit at different level of constraints(Architectures), specifying the deployment constraints(and hints about information) about that function, ranging from PackagedArch to DeviceWithHostArch, then finally DeviceOnlyArch.

In an original view, an Architecture itself can be any meaningfully grouped subtree in the global settings, as a result, the leaf itself can also be viewed as a subtree. That was the original rationale of the Target system and personally I do not find a strong difference between the two. But I also acknowledge the advantage to be able to separate out leafs as them being special. The main thing to preserve is the ability to specify architecture(of subtree) through out our divide and conquer process of compilation.

areusch · April 27, 2022, 2:56pm

Just to be clear about each of these cases–could we explicitly state their uses in the thread so everyone is on the same page? I think there might be questions about why you’d ever pass DeviceOnlyArch to tvm.relay.build().

tqchen · April 27, 2022, 4:13pm

Just to build on the current use case in the UX.

The most common setting we pass to build is DeviceWithHostArch(right now it is tvm.target(“cuda”, host=“llvm”), which hopefully internally get canonicalized to a PackagedArch with good defaults.

In a world where build is modularized and can take any IRModule during an intermediate stage of compilation, we could expect an IRModule that comes with collection of functions already constrained in some way(due to previous passes), each function containing some constraints arch attribute (As DeviceWithHostArch, or DeviceOnlyArch or some other variants).

A build function take these information into account to build the final module. Such IRModule could still contain a PackagedArch attr at the IRModule level assuming that constraint for the global module is consistent with the specific choices derived at the function level.

Again the need of V2 comes from the need to specify a such constraints through divide-and-conquer phases and be able to represent that intermediate state and constraints for future passes.

areusch · April 27, 2022, 6:08pm

I think there is a difference between what is being proposed here (argument to tvm.relay.build) and what is annotated onto an IRModule function. This proposal discusses adding an attr to the top-level IRModule with what’s called Architecture here. I do not believe we have tackled the question of: what should get annotated onto a particular Function.

Your example is of someone providing an IRModule with such annotations–in context of this proposal, we’re just talking about the top-level annotation. Given we are also discussing canonicalization, I think there was an expectation on my side that anything less than PackagedArch passed to tvm.relay.build would be canonicalized before being attached to IRModule, and therefore consumers of IRModule should expect only PackagedArch on the IRModule.

Does that agree with your understanding/is there any such use case you know of that could not annotate PackagedArch? the one I am thinking about is tvm::build, which is used in automation. I think we do need to accommodate that use case here, but it’s not so interesting as a counterpoint or example at this level of detail since it’s filled by the automation infrastructure and we could simply adapt that to follow what made sense based on more pressing design requirements.

tqchen · April 27, 2022, 7:21pm

To answer the specific topic of canonicalization, I think we agreed on canonicalization itself: Narrowing down to the context of relay.build convention I think it is helpful, i.e. relay.build simply canonicalizes and attaches a PackagedArch to the IRModule. That was also what we previously agreed to as well I believe in the PackagedTarget proposal. Note that under broader build context(e.g. an IRModule might only contains TIR function) PackagedArch may or may not make the best sense, however that can be left out for now as the particular PackagedArch attr requirement under the context of relay.build is quite reasonable.

Now on the broader discussion, it might be good to come back to the goals:

G0: Having a struct attached to IRModule
G1: Having a struct attached to Function specifying the build constraints of the function
G2: Ability to refer to such struct through simple tagging e.g. "aws/c4.xlarge" and recording

One of the key thing that we would like to preserve is an ability to enable a struct that covers out V2 needs through out the phases of divide and conquer and such base struct can be used to directly serve G0, G1, and G2.

Of course it is temping to simply focus on G0, which is I believe that leads to some of the reasonings and that what get annotated to functions are less relevant. However, from the overall architecture pov they are relevant in terms of design redudancy and simplicity. This is also considers the fact that previously we already have a design that is currently being used (the target in some recursive form, although not favored by all), that covers three goals(G0, G1, G2) and V2 overall. Introducing two structures effectively means increased complexity, and likely there needs to be a separate mechanism to handle G2.

There are some disagreements on the particular choice of data structure (de-coupling components), which is being addressed in the latest discussions. The latest set of discussions comes comes back to the needs of using Architecture to represent a spectrum of sub-trees per V2(instead of simply V0), which aligns with G0, G1 and G2, which is a positive direction that aligns with the goal.

kparzysz · April 28, 2022, 2:52pm

My idea for this seems a bit different, but maybe the difference is only superficial. Let me present what you stated here, but in the form I imagined, so we can see if our views match.

First, we have some set of components. These represent hardware blocks, and we can think of this set as a database of known processors, accelerators, etc. Let’s say we have

  // I don't know specific names, but "NVIDIA_GPU_type1" could stand for "RTX3080"
  // or something like that.
  Component NVIDIA_GPU_type1;
  Component NVIDIA_GPU_type2;
  Component AMDGPU_type1;
  Component X86_64_type1;

Then, for describing a specific system, we’d create an “architecture”:

  Architecture = {
    Components = [X86_64_type1, NVIDIA_GPU_type1, NVIDIA_GPU_type1];
    // Abbreviate C[x] = Components[x]
    Connections = [(C[0], C[1], "uni-directional"),
                   (C[0], C[2], "uni-directional")]
  }

This would represent an X86 with two GPUs, where the X86 can actively communicate with each GPU, but GPUs cannot actively communicate with anything.

We could then say

class DeviceWithHostArch : public Architecture {
 public:
  int host;  // Components[host] is the host.
};

Making architecture a member of DeviceWithHostArch would probably be better, but the idea is the same.

Seems like the main difference is that you put Target in the derived classes, whereas in my idea, targets (components) would be listed in Architecture. The components list in the architecture could have additional properties, like OS:

Components = [
  (X86_CPU_type1, Linux),
  (NVIDIA_GPU_type1, baremetal),
  (NVIDIA_GPU_type1, baremetal),
]

The idea is to have a set of building blocks, and a way to represent structures that we can build from them in a way that we can add more blocks without having to modify anything else (to enable their use).