[pre-RFC] Compilation Configuration Representation

tqchen · January 18, 2022, 3:02pm

Thanks everyone for the discussion so far. We have already got a lot of information about the goals and possible intentions of the design. One thing is pretty clear that the particular choice of data structure does have a decent impact in a few areas.

Before suggesting a concrete choice, I would like us to pop up a level and think about the hidden message about this discussion – How should TVM compilation pipeline, or compilation pipelines(assuming there are many kinds of them) be “configured”.

To help clarify the question I draw the following diagrams. A generic flow in TVM can be roughly summarized in the following picture:

We start with an IRModule(modA), possibly already been optimized by user or some previous passes
We run a set of transformations passes on modA to get modB
We then generate rt.Module from modB, in order to get it running on a specific platform in mind(e.g. a specific board).

We can find that there are roughly three kinds of “config-like” options appearing in this flow and that can affect the final outcome.

A0: The specific options used in transformation(e.g. How aggressive we want to inline)
A1: The build “constraints” of the platform of interest, this can be the instruction set(x86 or ARM), or runtime constraints(crt, packed-api vs unpacked-api).
A2: Within the IRModule itself, there can be additional constraints on existing functions. Imagine that a previous pass/user decided to optimize my_func on nVidia GPU, and have already generated a call to my_func via CUDA runtime API. Then follow up optimizations will need to respect that “constraint”.

To some extent, each of these A are somewhat inter-correlated with each other. For example, if we have a final platform constraint that does not support a vector unit, then it means that we will need to disable vectorization.

Nevertheless there are still two very distinct types of configuration here:

C0: In the case of A0: we are mainly interested in “how”, aka procedurally what we do with the program. In many cases, regardless of the transformations(e.g. inlining), the final outcome can run on the platform of interest.
C1: In the case of A1 and A2, we are declaring “constraints” imposed by the final platforms of interest(e.g. must have a vector unit, must use unpacked ABI). These constraints information do not dictate “how” we run the optimization, but can provide additional information for certain specializations.

The distinction of the two types are really important here. Coming back to the general goal of TVM. We want to enable composable optimizations of programs. Sometimes this can mean that some previous stages of program transformations are done by another developer, then feed into follow up stages.

C1 type config is something that we usually want to preserve as part of IR or log. For example, BYOC pre-decided that a certain function should be transformed to run on CUDA, then its caller must know that constraint and call using cuda runtime API. Such constraints need to be reflected as part of the IR(aka IRModule) itself, so that followup passes can respect and make use of such information.

C0 type config, however, does not need to appear in the IR(or intermediate data structure of interest). Imagine that we choose to separately inline my_func before handing over the current pipeline. Because the transformation is already done, followup transformations do not need to know this information as the IRModule itself after transformation is already self-contained.

Some of the discussions here started from a single monolithic pipeline, it indeed can be very tempting to consolidate everything into one configuration of interest under that scenario only. I would encourage us to look broadly into the composability perspective of it. Since composability is the key to encourage collaboration without putting too many restrictions about pinning all details about a pipeline. Some of the discussions also touch on this perspective. A very natural consequence of reasoning here is that we need to distinguish C0 and C1 type configurations (Folding pass context config into the whole config as result might go against this principle) in the foundamental level. Of course this does not pre-clude use to create an unified option interface(e.g. at the tvmc level), just at the level of the composational optimizations and things that we will put into the IRModule, we need such separation.

Another related topic is whether or not C1 type configurations can benefit from future dissection/clarification, or if there is enough common ground here to have a consistent branding.

areusch · January 18, 2022, 7:35pm

@tqchen thanks for these contextual remarks.

I would like to point out that in targets where we emit something closer to machine code (e.g. edge targets, hexagon, etc), C1-type config can actually inform C0-type config. For example, we may want to run additional passes or modify pass configuration based on the platform chosen in order to apply platform-specific optimizations. So I am really not convinced they are fully separate.

Recording some discussion here between @mousius @manupa-arm @tqchen @junrushao and myself for posterity:

A concern with this proposal is that it may cause duplication of configuration on the IRModule. That is to say, if we add CompilationConfig as an IRModule attribute, there is still a need to identify for each function in an IRModule: what sub-Target shall it run under, and what other Targets may it invoke? These questions have bearing on the codegen (e.g. when considering how to implement tir.call_packed_lowered) and on the Executor (when considering the state that may need to be passed down into a function and any additional execution engines which may need to be configured in order to run a function).
Meanwhile we still have yet to see a clear motivating example as to why we need a recursive definition of Target. @junrushao and @tqchen could provide some follow-up to this point.
There has been some suggestion that autotuning log keys could be defined at a high-level as “C1-type config.” I disagree with this suggestion, as I think it’s likely that both the autotuning method (e.g. AutoTVM, MetaScheduler, AutoTIR) plus the specific runtime/executor config play into this. I think each tuning method is going to need to define a way to cull the Target or CompilationConfig in order to define what goes into a tuning log. If there is agreement on this point, I would like us to focus discussion on this RFC thread around ensuring that whatever data structure we choose here makes it easy to accomplish this culling process independently of considerations of where to place configuration.
Finally, this RFC started by proposing an improvement in the user-facing configuration; however, it seems that the part of it which is causing most controversy is that it affects the compiler’s internal configuration state. It may help to have a more focused RFC to collect community feedback around how we should configure the compiler at the IRModule level. Meanwhile, to address the concern of duplicating state above, it would help to see a sketch of how this proposal might suggest we replace Targets at the IRModule and function level. Originally this was left open, but I think it would help to clarify a bit further to understand the impact of this RFC.

tqchen · January 18, 2022, 8:28pm

You are right that C1 style config can inform the pipeline choices of C0-type config, but not necessarily the other way around(as being covered in the discussion). No so much the other way around. This is indeed not a clear cut, but useful enough to think about such a separation.

Mousius · January 19, 2022, 3:39pm

Just to clarify, one of the main motivations for this is the tvmc argument --config which can be directly translated into the CompilationConfig; however, the structural improvements made using the configuration illustrate how this provides improvements through-out the TVM stack, I didn’t mean to encourage the notion that only the tvmc flow was considered when presenting this RFC.

In TVM today Target annotations are attached to BaseFunc as part of the main compilation flow, as this RFC does not aim to replace this mechanism it would result in an flow such as:

Bare in mind, the only change that has been made is the source of truth for the information is now gathered into the CompilationConfig and provided as part of the IRModule; everything else exists in the TVM compiler today.

I would challenge the conclusion that the distinction is important, to a user the differentiation and different placement of information generally leads to confusion. I’m also unsure where we’re removing composition by allowing users to take an IRModule complete with it’s configuration and transfer that? This seems like an overall improvement in composability to me rather than the current side-loaded configuration in PassContext which has no real structure today. What this leads me to think is that we should introduce CompilationConfig and use it as a mechanism to force better design choices in the way we handle options that can be transferred as part of an IRModule and better aid composability in TVM.

tqchen · January 20, 2022, 3:45pm

Compositionality is a fundamental philosophy so please allow me to elaborate a bit more here. One great example to show its importance is the design of deep learning frameworks.

In a deep learning framework. The concept of layers is composable. I can take a residual layer, compose it with softmax loss function and optimizer loss. These layers are abstracted under a common interface nn.module. Each Module transforms an object of interest – Tensor.

Tensor itself can be viewed as containing certain C1-type constraint information. Such as the shape, the data content, the device it resides in. Some of the layers (e.g. CuDNNLayer) may only work under the constraint of a gpu device.

Importantly, there are also C0-type configurations. For example, the number of hidden neurons or stages of residual connections. These information are not part of Tensor, because it is sufficient to apply those transformations and Tensor itself contains minimum but sufficient information for followup layers to apply further transformations. Applying more information to the Tensor could create more constraints, and possibly confusion about how to handle those attributes.

Deep learning frameworks are maximally composable; we can compose a residual block with a classification loss(softmax) or detection loss to form different models. These layers can be developed by different developers.

In summary, composability is obtained by decoupling information and clarifying the minimum but necessary information in the key data structure of interest.

Come back to the case of TVM. IRModule is a bit like Tensor, and we are talking about putting all the configurations of the layers, as well as the device information into a centralized place. If we are only looking at one official model(say resnet-256), this of course can help clarify all the options available, but would restrict the evolution to a single pipeline(and forcing all the options into that one place). A deep learning framework approach would be to only keep minimum information(C1-type) in the key data structure, allowing C0-type separately. For specific applications, there might be a centralized configuration(e.g. argparse) which informs C0 type config, but that centralized config is not part of the Tensor.

In summary, putting all the configurations(C0 and C1 kinds) into a single place will certainly improve clarity if we only have a single pipeline in mind. The C0-type configuration brings un-necessary and sometimes confusing information. Remember that pass writers generally need to take the constraints in IRModule seriously, having C0-type information in IRModule would make developers wonder whether or not they should be considered (which is an open set as we grow the set of passes).

In the spirit of minimum but sufficient principle, we want to limit the information attached to IRModule to C1-type information. Note that this does not preclude that a high-level single pipeline builds a centralized config which then propagates to the lower-level mechanism. I believe that was the original intention, and the main reason I brought up compositionality is that at the level of IRModule and pass we will need to consider such separately carefully.

Since the topic of compositionality is quite important. Let us also study a few more examples:

Example 0: Imagine that we want to try out the following thing. stage0: different kinds of unroll factors or vectorization factors, benchmark, do them in a loop then compare the result; stage1: send to the followup lowering optimizations with another set of configs. In this case C0-type config(e.g. unrolling-factor) in stage0 is not relevant to stage1. Given the collection of different choices of stage0 in a loop, it is also not entirely desirable or possible to centralize the configs into a single data structure for this case.

Example 1: Imagine that people want to build alternative compilation pipelines with a different set of configs(e.g. Running through quantization then building). In this case it may not be desirable to couple the two types of config together, since each pipeline may only care about one set.

We can find that most of the examples come from alternative optimization pipelines and choices that may be different from the current build pipeline. These are however, important cases to support so they either can be incorporated into future pipelines, or simply enable more preprocessing choices that composes with the build pipeline.

Compositionality does not imply that we need to ask everyone to use the same config for all possible pipelines . Instead, the main implication is to clarify what is minimum but necessary (and needs to be considered by all passes/pipelines), while leaving out other parts, so it leaves space of flexibility to others.

Coming back to the particular topic of this RFC. I think we acknowledge that it could be useful to have a centralized config for a single tvmc pipeline which can help to bring the clarity. We also agree that the discussion is not about changing the set of information, but mainly about how we organize the infomation. The main point of compositionality is to carefully dissect the two kinds of configurations when it comes to putting the information in IRModule and how do the two kinds of configurations interact with passes.

manupa-arm · January 21, 2022, 10:38am

Let me try to summarize the conversation as I understand it – Please feel to correct me if its wrong.

It mainly boils down to the following point :

What should be attached to an IRModule and what shouldn’t ? According to @tqchen’s description above, it should be C1-style “constraints” and not C0-style “how”. The argument being, C0-styled information are a configuration for passes and not broadly applicable to all transform passes, thus confuses the pass implementation with the choice of what to do with them.

Mousius:

config = CompilationConfig(
    target_host=target,
    targets=[target],
    executor=Executor("aot", {"interface-api": "c", "unpacked-api": True}),
    runtime=Runtime("crt", {"system-lib": True})
)

According to the definition of C0 and C1, the above information should be classified as C1. Therefore, are we all agreeing to the fact that is reasonable to be attached to the IRModule ?

As a first step, if we all agree it would be great to unblock ourselves to use a C1-styled CompilationConfig attached to IRModule to proceed in short/medium term. @tqchen @Mousius @areusch – An explicit reponse to this question is highly appreciated

Now coming back, in today’s state of TVM, C0-style broadly refers to PassContext – Im not aware of anything else. Therefore, the current point presented argues against putting C0-styled PassContext either as a) IRModule attribute or b) part of C1-styled CompilationConfig that is already agreed to be attached to IRModule.

Then, for future work, we should think about “necessity” of keeping the C0-styled PassContext as a side channel (or infact a singleton). IMHO, this contradicts slightly with what @jroesch proposal of integrating the whole compilation pipeline as IRModule → IRModule transformations by committing ourselves to maintain a side-channel driven from the need of the separation of C0-styled vs C1-styled information.

Therefore, it would be great to explore options how to attach/package all possible information that “might” (C0+C1) – not just the “minimum”(C1) – be required from all passes. We thought this could be done by attaching to the IRModule – so that we could export without requiring any other side-channels. However, we are open to hear alternatives.

tqchen · January 21, 2022, 3:19pm

Thanks @manupa-arm , these clarifications pts are helpful.

First of all, the discussion of C0-style and C1-style is more on the design principle level and do not ties into the actual choice of data structure or implementation. In some cases they could imply certain preferences of choices, e.g. calling a C0+ C1 combo as a CompilationConfig certain makes sense, and if it is a C1 only thing something in the target namespace is more natural. But let us first separate these concerns and not talk about the choice of data structure.

The main design principle is as follows

At the IRModule and individual pass(IRModule->IRModule) configuration level, C1 type config should be attached to IRModule while C0 should be handled separately(and not attached to IRModule)

As a first step, if we all agree it would be great to unblock ourselves to use a C1-styled CompilationConfig attached to IRModule to proceed in short/medium term. @tqchen @Mousius @areusch – An explicit reponse to this question is highly appreciated

We agreed that the above information(executor, runtime) are part of the C1-style configuration and it is helpful to introduce a data structure to store those information.

We were originally discussing whether target.packaged(as target was the namespace used for constraint style configurations, consistent with the previous composite target RFC that was accepted) or introduce a separate data structure CompilationConfig(this RFC).

Regardless of the data structure of choice, they are going to unblock the following features since both proposed data structures are isomorphic.

The main intention of the last few post however is to clarify the design principle, since this have a bigger impact than the choice of data structure.

. IMHO, this contradicts slightly with what @jroesch proposal of integrating the whole compilation pipeline as IRModule → IRModule transformations by committing ourselves to maintain a side-channel driven from the need of the separation of C0-styled vs C1-styled information.

The discussion of separation does not advocate for the use of PassContext or any particular data structure. But mainly focus on the importance of separation of two types of informations and only keep C1 style in the IRModule.

This does not contradict the IRModule->IRModule transformation as whole pipeline and actually is an faithful realization. The main motivation of IRModule->IRModule transformation comes from the need of compositionality. We have already have extensive discussions on this point in some of the previous posts.

The IRModule->IRModule principle originated from Tensor->Tensor principle in deep learning system designs, where the key data structure(Tensor or IRModule) contains all the necessary information that are needed to carry among the sequence of actions. They do not imply however, that all the configurations of previous actions(that are irrelevant, sometimes impossible to list comprehensively if we are in a look) should be recorded in the data structure. As we can see in the example of deep learning framework design.

Finally I want to say that there can be a need for a C0+C1 style config(let us call it C2) on a high-level application(one realization of pipeline) that composes the passes together. For example, there can be a train_resnet application that comes with a argparse.opt that contains learning rate, layer configurations as well as the device we want to run on. The C2 object then separately configs the C0 and C1 configs using lower level mechanisms(where they are separately) and drive the end to end compilation.

My read of the RFC is that there is a desire to have something like that. To follow the precedence of deep learning framework modularization. What that implies is to keep the C0 and C1 mechanism, perhaps introduce a C1 target.packed object as the data structure that attached to the IRModule. Then also introduce a C2 CompilationConfig that only ties to one compilation pipeline(perhaps the default one used by tvmc) at a different abstraction level. C2 config will populate the C0 and C1 style configurations.

This would indeed bring a bit more duplications. But like in the case of deep learning frameworks. Such duplication is necessary for modularity at different abstraction levels. Mainly because centralizing everything in C2 is not sufficient for all possible composable pipelines(see the previous examples on searches loops and alternative paths) nor minimum for pass writers to increase composability.

This is a case where precedence designs(deep learning frameworks) are really matured and can serve as really good reference pts. It is also a case where the lesson of deep learning frameworks shows that such choice is critical to the success of the framework as a whole, so it would be good for us to consider that.

Mousius · January 21, 2022, 6:21pm

Hi @tqchen,

I appreciate your reply with further clarifications, though I’m struggling to reconcile them with the original RFC presented here.

Although they’re isomorphic in structure, the packaged Target has no proven advantage and serves to increase the overall complexity of any additional work in TVM due to the considerations of a potentially recursive Target. I would need a strong motivation for implementing such a complex design, given this RFC aims to reduce complexity by creating explicit structures.

tqchen:

Finally I want to say that there can be a need for a C0+C1 style config(let us call it C2) on a high-level application(one realization of pipeline) that composes the passes together. For example, there can be a train_resnet application that comes with a argparse.opt that contains learning rate, layer configurations as well as the device we want to run on. The C2 object then separately configs the C0 and C1 configs using lower level mechanisms(where they are separately) and drive the end to end compilation.

My read of the RFC is that there is a desire to have something like that. To follow the precedence of deep learning framework modularization. What that implies is to keep the C0 and C1 mechanism, perhaps introduce a C1 target.packed object as the data structure that attached to the IRModule. Then also introduce a C2 CompilationConfig that only ties to one compilation pipeline(perhaps the default one used by tvmc) at a different abstraction level. C2 config will populate the C0 and C1 style configurations.

As in my previous post, I didn’t mean to encourage the notion that only the tvmc flow was considered when presenting this RFC. C2 is where tvmc is right now, working around the limitations of the TVM API, with graph partitioning tvmc creates its own pipeline on top of TVM to make the feature simple to use. The RFC is therefore aiming to bring some of the learning from tvmc back into the main compilation flow, with some of the advantages listed in the original post that point towards construction of a configuration, whether it be C0 or C1 for use in any compilation flow.

Taking a step back, if we consider this RFC to be adding the C1 type of configuration, is the requirement for moving this forwards that we must also define a mechanism for C0 configuration? Or can we leave dealing with the global state of PassContext to a future RFC where-in we can discuss how to better manage C0 configuration?

Furthermore, if we accept that C1 configuration can be attached to an IRModule, what prevents us proceeding with the CompilationConfig initially suggested given we still have yet to see a clear motivating example as to why we need a recursive definition of Target?

junrushao · January 21, 2022, 6:48pm

The advantage of package Target has been extensively discussed in our previous posts.

To clarify, in production, there exists non-trivial usecases with Target. For example, there might be CPU + DLA + GPU case, where Target does need to be expressive enough to represent them. As a simplest example, the config of Jetson is:

TVM_REGISTER_TARGET_TAG("nvidia/jetson-agx-xavier")
    .set_config({{"kind", String("cuda")},
                 {"arch", String("sm_72")},
                 ...,
                 {"dla", SomeOtherTarget},
                 {"host", SomeOtherTarget}}});

In our general principle, we do need C1-style configuration for mixed-device compilation. Notably, this configuration could differ from the IRModule-level annotation, if we intend to restrict the possible constraints during optimization.

Second, in BYOC, there is actual need to pass in additional recursive configuration. For example, some BYOC target needs additional configuration, e.g. as TQ mentioned previously, the composition below is a possible need:

- host: x86
- vdevice0: byoc-myawesome-cuda
    - host: x86
    - runtime: cuda-graph
    - vdevice0: cuda
    - library: tensor-rt
- vdevice1: cuda
- runtime: vm

Overall, the design argument here is a subjective matter. As we can see, in the following discussion, introducing a separate class for single&multi-device constraint also brings additional design and engineering complexity for logging/tagging and the overall compilation pipeline, so it’s really a trade-off.

Notably, packaged Target doesn’t mean it is unstructured or encouraging arbitrary recursion, we can of course enforce schema validation to make it structured and ensure correctness here.

Additionally, we do see benefits to have a common base class for C1 type data structure. From the automation point of view, we need to record the constraint in different cases, and as we have a common base class (Target), it would help with tuning log serialization for both single&multi-device functions. Furthermore, it also brings additional benefit in terms of consistency of representation. As a real-world usecase, if the constraint of a function is annotated as DLA + GPU, it’s relatively easy to narrow it down to a GPU-only function instead if we use the common Target class here - and in this case, it’s helpful to represent DLA + GPU and GPU-only constraint as a common data structure for consistency; the same idea applies to host/device split pass in TIR.

Finally, we would love to reiterate the the advantage of packaged Target and from our PoV, it helps with more generic usecases and maintains the clarity of TVM’s compilation pipeline.

tqchen · January 22, 2022, 12:08am

Thanks @Mousius . I don;t think we need to settle down on mechanisms for C0 configuration. The original intention of the RFC appeared to be C1 style but then the discussion drove it towards C0 + C1 style.

So I agree that we should figure out the data structure choice for C1 style configuration in this post.

We all understand and agree the possible advantages bought up a single point setup. Just that there are also other side consideration in terms of compositionality, consistency, and extensibilities. as some of the posts bought up.

The suggstion of C2 style CompilationConfig that translates into C0 and C1 style at lower-level actually is meant to serve as reconciliation here that learns from the previous precedence in deep learning frameworks.

Mousius · January 24, 2022, 12:15pm

There is some need to name a specific of Targets configuration, which is present in the CompilationConfig and matches to how the TVM partitioning pipeline currently behaves without the configuration (it is already planned to use named configurations in tvmc and we’re not planning to remove Target tagging). I can’t see the lines you’re referring to as motivating though:

github.com/apache/tvm

src/target/tag.cc

813136401


      
          #define TVM_REGISTER_CUDA_TAG(Name, Arch, SharedMem, RegPerBlock) \
            TVM_REGISTER_TARGET_TAG(Name).set_config({                      \
                {"kind", String("cuda")},                                   \
                {"arch", String(Arch)},                                     \
                {"shared_memory_per_block", Integer(SharedMem)},            \
                {"registers_per_block", Integer(RegPerBlock)},              \
                {"max_threads_per_block", Integer(1024)},                   \
                {"thread_warp_size", Integer(32)},                          \
            });

There is precedence for using the list of Targets successfully in tvmcs Target parsing infrastructure and as part of a multi-Target workflow (see: microNPU Demo) which is being actively used and extended to incorporate multiple devices and Targets.

This is also currently supported as part of the existing BYOC flow currently in use for multiple Targets as evidenced by the aforementioned demo where the microNPU is configured separately. Extending this further is definitely something to explore, but given the functionality exists today it is unnecessary to do it as part of this iteration.

Based on the evidence presented in this post, the current behaviour of the TVM codebase is demonstrated, showing that CompilationConfig is a step towards further supporting the features which already exist without the need for an increase in complexity for Target. Introducing a different mechanism for this is unnecessary given the existing functionality and this RFC is in fact a small iteration on the existing approach to codify the behaviour.

Target with-in this implementation is similar to a dict with a schema? Why is this favoured over an explicitly typed class which has the same benefits but additionally has compile time checks and clearly documented form in code? As for serialization, the encapsulation of auto tuning serialization logic into a configuration object would be a clear boundary for auto tuning to work from which can still invoke the various Targets serialization functions. I don’t see this as any additional effort over implementing such a function for a packaged Target.

In the post What is ‘target’ in TVM? it is clearly demonstrated that this overloaded use of the term Target does not create clarity but introduces further confusion to both users and developers. This has also been articulated over the previous posts.

areusch · January 25, 2022, 8:09am

Discussed further to understand @junrushao and @tqchen’s perspective today. Here are some notes:

Summarizing TQ/Junru’s proposed config philosophy: Compiler configuration can be categorized into two rough categories:
- C0: Configuration which describes the method by which a part of the compiler should work. Most often, this config affects only one pass and can be thought of as configuration specific to the Pass. Currently we store this in PassContext. Most (all?) C0-style configuration should not be used from one pass to configure a downstream pass.
- C1: Configuration which describes the environment where the generated code will run. Here, based on precedent from other deep learning frameworks (@tqchen can cite the specifics), we prefer to express this in terms of a set of constraints. Based on the precedent, constraints are a good representation here because it allows the compiler internals to remain as flexible as possible. In order to allow the compiler to continue to evolve and compose as new libraries, targets, and deployment strategies are added.
  - Because these are constraints, we should choose a representation that specifies the range of each axis. Passes should be viewed as a set of processing steps which may modify (typically, narrowing) the range of each axis. When compilation is done, the compiler generates code to match the remaining constraints.
One of the goals of CompilationConfig is to consolidate compiler configuration so it’s easy to manipulate on disk and simple to understand the supported configuration.
- CompilationConfig shouldn’t be seen as replacing C0 (e.g. PassContext) or C1 (e.g. Target) style config. Andrew: it should be seen as a composite data structure containing a representation of both. This allows parallel compilation flows to be built and maintained independently of CompilationConfig, which certainly could be used elsewhere but is primarily motivated by tvmc.
- As a starting point, it would be great to consider CompilationConfig as the method to specify configuration for the tvmc-based flow rather than as the singular way to configure all tvm.relay.build.
- Andrew: A desirable property of CompilationConfig is that the pieces of the composite struct which correspond to the compiler internal configuration are trivial representations of the actual structures used internally.
- For PassContext: this is essentially restricting the data types of the values and defining a serialization into an e.g. yaml or json map.
- For Target: this gets more complex. We can split this problem into a couple parts:
  - Let us define a Leaf Target as that subset of a Target configuration specific to a single Target subclass. In other words, exclude any relation between targets and the Executor and Runtime configuration. This part is essentially a schema’d version of the PassContext problem.
  - More complex are the Executor, Runtime, and “packaged” Target proposals discussed earlier. Complicating the problem is that these items form program-level constraints, but some parts of these could be function-level constraints. For now, the compiler builds only one such type of program (e.g. a single program per module, if memory serves). This may change in the future. Additionally complicating the problem is that there are some examples of sub-programs which may be driven by an executor, thus needing similar information. And finally, we have currently already baked much of this into Target via the executor parameters (which were recently split out but also the subject of this continuing RFC) and via target_host.
  - This RFC doesn’t need to resolve a proper way to model all possible program constraints, but if we are attempting to choose a way to model this constraint such that it can be reflected trivially into CompliationConfig, we should choose a system that can be easily extended to describe a flexible set of constraints, so that people adding new types of executor relations (e.g. creating sub-programs with Executor constraints, similar to the TVM Unity effort) aren’t hampered by this config system.
  - So long as we are able to build an extensible system, we could probably start with a Target equivalent which lacks a recurrence relation. It’s an open question how this should be reflected in disk configuration.
  - The struct which defines the whole-program constraint should probably continue to be called Target to avoid confusion. As we explore sub-program constraints, we may want to either extract pieces of Target into a common base class (at least the parts that handle the schema). It may be wise to extract Leaf Target information into a separate class with a better name.

cc @mbs-octoml

manupa-arm · January 25, 2022, 12:37pm

To be a bit pragmatic of progress here, I would propose lets do the minimum step that we are after is better representation of C1-typed information in the compilation flow.

areusch:

CompilationConfig shouldn’t be seen as replacing C0 (e.g. PassContext) or C1 (e.g. Target) style config. Andrew: it should be seen as a composite data structure containing a representation of both. This allows parallel compilation flows to be built and maintained independently of CompilationConfig, which certainly could be used elsewhere but is primarily motivated by tvmc.

As a starting point, it would be great to consider CompilationConfig as the method to specify configuration for the tvmc-based flow rather than as the singular way to configure all tvm.relay.build.

Andrew: A desirable property of CompilationConfig is that the pieces of the composite struct which correspond to the compiler internal configuration are trivial representations of the actual structures used internally.

For PassContext: this is essentially restricting the data types of the values and defining a serialization into an e.g. yaml or json map.

Could we leave this out to a seperate RFC to bring C0-stlyed information into it ? It is proving complex to solve all of this together.

I personally identify this is the step we want solve as the first step of many, therefore lets get this sorted .

We are fine as long as we dont use/overload the same data structure for both (leaf and non-leaf). We can discuss about what is a good name for this.

I agree with @areusch here that current state of TVM build only a single program and I would think this RFC does not block any further future RFCs that wishes to support multi program execution / partitioning.

I dont think we are mandating a “freeze” on the non-leaf target data structure in this RFC

Therefore, it would be wise for us to propose the extentions when and where such are proposed. As a community, we should try to discuss the levels of API and partitioning strategy which will nicely motivate the addition to the non-leaf Target to support multiple programs.

Me and @Mousius spent few cycles thinking about this… We reached the conclusion what we are after is the seperation of non-leaf target and leaf target. We have proposed here to call the former as CompilationConfig and latter to remain as target. However, after the discussion, it seems it also make sense to keep the non-leaf target as “Target” – if it is meaningful and reduces confusion – while we can look to rename the leaf target be something else (e.g. Backend).

@Mousius any more thoughts ?

areusch · April 6, 2022, 2:57pm

I discussed this with @tqchen, @junrushao, and @mbs-octoml. tl;dr we are broadly in agreement with this RFC and we think it can proceed.

This post will start by re-summarizing our understanding of the motivations for such an invasive IR change. Then, it will cover the controversial parts and explain the various approaches. Finally, it will summarize our opinions and conclude with our opinion of the best way forward.

This thread was incredibly long. Now that the format of the TVM Community Meeting has changed, I’d suggest we bring further discussion of large design changes like this one to those meetings for higher-bandwidth discussions.

Motivations for this Change

This RFC proposes to overhaul the way the TVM compiler is configured. The motivation behind this is to export the compiler configuration into a human-readable format (e.g. YAML) that can be consumed by a command-line tool (e.g. tvmc).

Additionally, there is a desire to place the full target configuration in the IRModule somewhere as an attribute so that it can be used in various passes (@Mousius and @manupa-arm, would be great to re-clarify this).

Classes of Configuration Affected by this Proposal

A discussion point that arose midway through this RFC is around the classification of configuration involved with this proposal. @tqchen proposed two classes:

C0. Configuration that directly specifies how some process in the compiler is carried out. It’s important to consider this in the abstract when understanding the motivations for the decisions here. In practice, it’s good to note here that in the codebase today, this roughly is PassContext.

C1. Configuration that specifies constraints on the compiler without giving a specific way to accommodate them. This configuration typically specifies properties of the deployment environment. The sentence in C0 about considering this in the abstract also applies here. In practice, it’s good to note here that in the codebase today, this roughly means Target.

Point of Clarification: this RFC is confined to C1-style config. A follow-on RFC may consider C0-style config.

What can be attached to an IRModule?

This RFC proposes that we attach the full CompilationConfig to an IRModule. Before the previous point was clarified, this was contentious. We discussed at length the question of what style of Configuration should be permitted to be attached to IRModules. The resolution was that there is consensus that C0-style confjg should not be attached to IRModules because it may create behavioral coupling between Passes which could be difficult to unit test. There is a strong desire to avoid coupling between Passes to keep them composable and retain flexibility in the compiler.

The result of this discussion was a decision that CompilationConfig itself should not be attached to an IRModule; rather, that C1-style config it contains (namely, the Target information) should be attached instead.

Why attach C1-style CompilationConfig to an IRModule?

There is one question unanswered in the previous section: what is the motivation for attaching C1-style CompilationConfig to IRModule? There are two points to make here:

There was a need by ARM folks to reference the Target from some passes [@mousius @manupa-arm it has now been so long since we discussed this I have forgotten which one required this—feel free to add it in]. Target is an object currently passed around the compiler on the stack as necessary. Last year, @jroesch began an effort to attach all of this “extra” (e.g. stack-passed information, or information tracked in flow-level compiler classes) to the IRModule during compilation. Target is yet another instance of this, so attaching it to the IRModule is the medium-term correct way to expose it to the pass ARM is trying to write.
The ultimate goal of this RFC is to expose the compiler’s configuration to tvmc users in a form that could be edited, serialized, and deserialized without needing to write Python or have a copy of the TVM source code. Since tvmc users have little visibility into the compiler source, it’s beneficial to eliminate any translations between the configuration they edit and the configuration accepted by the compiler. Attaching C1-style ComplationConfig (e.g. Target) directly to IRModule and referencing that as the authority on C1-style config accomplishes that goal.

Representation of Target

We now turn to the most contentious piece of debate: how should Target be represented? There are two types of Targets considered here:

Leaf targets. Identifies a single TVM backend (mapping to a single DLDevice at runtime) which, when used with the broader CompilationConfig, will generate functions which depend only on that device for execution.
Composite targets. Identifies a collection of Leaf Targets, one of which is considered the “host” (and therefore, which will host the Executor infrastructure).

Target is typically thought of as a parameter to tvm.relay.build. Currently, when a Leaf Target is passed to tvm.relay.build, it is promoted to a Composite Target with the “host” considered to be the same Leaf Target.

The contentious piece here was how to represent composite targets. Several options were proposed:

D0. Introduce “packaged” Target

This proposal suggests we introduce a new Target type:

{
  "kind": "packaged",
  "runtime": "crt",  
  "executor": “...”
  "target": {
    "kind": "cuda",   # the target that TIR generates to
    "host": {
      "kind": "llvm", # the codegen target for the host-side driver code
       ...
    }
  },
}

def tvm.target.packaged(
  target="cuda",
  executor="aot",
  runtime="crt",
): ...

The advantages to this option were:

It allows reuse of the Target schema infrastructure specified in src/target/target_kind.cc and friends.
It requires minimal effort to implement.
It is polymorphic—any attribute in an IRModule where a Target was required could be either a Leaf Target or a Composite Target. This means that where some flexibility was desired, the compiler could begin with a Composite Target and, via Passes, arrive at a Leaf Target. The example given here was in deciding where a Relay function should run.
Common needs such as in-memory repr for efforts such as Collage are already implemented.
No modification to [tvm.relay.build](http://tvm.relay.build) needed aside from adjustments to [Target.check_and_update_host_consist](https://github.com/apache/tvm/blob/main/python/tvm/target/target.py#L222)

The disadvantages to this option were:

Polymorphism can lead to confusion. When an attribute exists on a part of an IRModule which could be either Leaf or Composite Target, passes need to add extra logic to determine which kind of target is present. Asserting that an IRModule is well-formed is more difficult and could be a more difficult process for the programmer to understand.
It is presumed that tvmc-level configuration could be specified by more than one user. For example, a part of that configuration could be specified by the hardware vendor, and another part could be specified by the tvmc user. While it would be illegal for packaged Target to contain another packaged Target, such rules would need to be enforced by runtime logic rather than the type system. In a situation such as the one just posed, where multiple partial configurations exist and are combined to form a whole, it is vital that the user be able to understand the rules for combining partial configurations. Given the potential for infinite recursion allowed by the type system, those rules become difficult to specify.

D1. Adopt explicit LeafTarget and PackagedTarget classes

In this option, LeafTarget and PackagedTarget are represented by distinct classes which inherit from a common base class e.g. TargetBase. TargetBase is presumed to contain only infrastructure such as schema representation and in-memory repr functionality. It would not be considered to be a valid attribute type in the TVM compilation pass, with one exception: it would be valid for a single component to store TargetBase when:

It is not attached as TargetBase to an IRModule seen from another Pass.
It is convenient for that component to represent a flexible Leaf or Composite Target.

The proposal is sketched below:

class TargetBase:
    kind : str

class LeafTarget(Target):
    kind: str
    host: Optional[LeafTarget]
    …

class VirtualDevice:
    Target: Optional[LeafTarget]
    device_id: int

class PackagedTarget(Target):
    target: LeafTarget
    host: LeafTarget
    executor: Executor
    runtime: Runtime
    devices: List[VirtualDevice]

The advantages to this option are:

It allows reuse of the Target schema infrastructure specified in src/target/target_kind.cc and friends.
It requires minimal effort to implement.
It is explicit—there is no confusion between PackagedTarget and LeafTarget where attached to an IRModule.
Common needs such as in-memory repr for efforts such as Collage are already implemented.
No modification to [tvm.relay.build](http://tvm.relay.build) needed aside from adjustments to [Target.check_and_update_host_consist](https://github.com/apache/tvm/blob/main/python/tvm/target/target.py#L222). However, we could modify tvm.relay.build to take PackagedTarget only in a future update.

The disadvantages to this option are:

The kind field is present on the base class and could suggest polymorphic use in the code.
Polymorphic use needs to be disallowed in code review.

D2. Adopt separate PackagedTarget and LeafTargets without any common base class

This option fully separates the PackagedTarget and LeafTarget classes:

class LeafTarget:
    host: Optional[LeafTarget]

Target = LeafTarget

class VirtualDevice:
    Target: Optional[LeafTarget]
    device_id: int

class PackageConfig:
    host: LeafTarget
    executor: Executor
    runtime: Runtime
    devices: List[VirtualDevice]

The advantages to this option are:

It is explicit—there is no confusion between PackagedTarget and LeafTarget where attached to an IRModule.
The API to [tvm.relay.build](http://tvm.relay.build) could be made the most specific of all of the options.

The disadvantages to this option are:

Target schema and repr infrastructure needs to be re-implemented.
It requires a big lift that may be difficult/impossible to do in an incremental way.

Decision on Target Representation

We conclude that D1 is the best approach. It has the benefits of explicit typing on IRModule and in flow-level compiler classes while retaining flexibility which could prove useful in implementing future projects which may experiment with composite targets, such as Collage. Collage will discuss these efforts shortly at the TVM Community Meeting and in an RFC.

Example of Partial Configuration

Finally, an example of partial configuration, as it had bearing on the discussion:

my-soc.yaml:
tag: my-soc-base
target:
  kind: ethos
  memory-size: 128
host:
  kind: llvm
  mcpu: cortex-m33
runtime:
  Kind: c

app.yaml:
executor:
  Kind: aot

Our Conclusion

The RFC as proposed should not be in conflict with the consensus we reached. We prefer the implementation of the RFC to re-use the schema and in-memory repr infrastructure developed for Target by adopting a common base class. Only the PackagedTarget from CompilationConfig should be attached to the IRModule, leaving room to add PassContext to CompilationConfig in a future RFC.

Mousius · April 14, 2022, 2:50pm

@areusch thanks for coming back to this and working to get this resolved, unfortunately I think we’ve reached an impasse, which I’ll attempt to articulate further.

This is one of the initial motivations around this change, to support moving the BYOC infrastructure further into the core compiler as well as create a less dynamic approach to gathering Executor/Runtime/Targets from an IRModule given they should be non-optional. BYOC Targets are only known before relay.build, and have functions partitioned with Targets or kCompilers that can only be found on the graph nodes, due to much of it being implemented in tvmc. The RelayToTIR hook walks the graph looking for such annotations to reverse engineer this information. We can also see the need for multiple Targets in the Collage RFC.

If we could use one object for both context and constraints, that would be ideal; if we require the two types of configuration to be separated then it’d be better for tvmc to combine the PassContext and CompilationConfig using a higher level of abstraction (visible only in tvmc) rather than try to provide both levels of abstraction in one object. As such, I believe the tvmc configuration object can call CompilationConfig::FromJSONNode(node) or similar to process that portion of the object, this would be an improvement over the currently proposed variant of --config which is being added without CompilationConfig.

By using a common base class of Target, this change introduces further confusion in the Target system, which I evidenced as already being problematic above; by introducing PackagedTarget and LeafTarget we introduce even further new terminology and start using Target not only as a TVM target but also as a JSON format for other objects in the codebase. Given that the Target serialisation is straight-forward JSON, we should be able to encapsulate that in a JSON serialiser that enumerates the fields of configuration rather than using the Target purely for the side effect it can export JSON in the correct format.

Summarily, it’s counter to this proposal to create further Target confusion both internally for compiler engineers and externally for users; given the length of this thread I don’t believe this will be a short-term solution and is likely to promote further confusion as to the definition of Target. As I’m under the impression this currently blocks the TVM release, in the spirit of moving forwards, I would suggest we consider this RFC rejected and continue with current the relay.build API.

kparzysz · April 25, 2022, 9:20pm

Let me get back to this thread What is ‘target’ in TVM? for a moment. First of all, the fact that such a thread was started shows that there is a lack of clarity about what “target” really means, and the thread we’re in does little to address it. Andrew acknowledges this lack of clarity in his reply, and states that “target” is essentially the “deployment environment”. Problem is, that this is a concept far too rich to express it via a single definition.

I think we should reconsider the meaning of “target”.

I don’t think that anyone here opposes the idea of coalescing the “compilation configuration” into a single data object. Rather, the objections stem from the concept of “target” itself.

Target structure

There is hardware on which we want to execute some model. This hardware can consist of multiple components, each of which may have different instruction set, different operating system (or none at all), etc. When compiling for this hardware, the first thing we need to know is which part of the code will run on which component. This assignment can happen early on (in relay, as is the case with BYOC), or later on in TIR. The Collage work is (IMO) quite an elegant solution, which could be transferred (conceptually) to TIR.

The key information here is the structure of the hardware in terms of which component is connected to which other component, and what the capacities are of each component. This will decide what kinds of code can be compiled for, and executed on that component. The other question is whether given code should actually be executed there. So, what we need to know about each component is (1) capabilities, and (2) performance characteristics of each component. This is obviously in addition to (3) instruction set and operating system.

Components as units

The characteristics of each component are mostly self-contained, and independent from the presence or absence of other components, which suggests that components should be described independently from their placement in the hardware. Usually there will be a component that is more capable than others, and is the one that users interact with directly, although there is no reason to assume that this will always be the case: we can consider two computers connected via a network as the piece of hardware we want to run the model on.

Architecture

I propose that we separate the concept of the architecture (i.e. the arrangement of components) from the components themselves. What we previously called “packaged target” would map to the architecture description together with the description of each component in it.

We could then apply the term “target” to describe an individual component in an architecture. We wouldn’t need composite targets, or even “host target”.

For each component we could then describe either an external compiler (for use in BYOC), or internal code generator (for TIR compilation).

tqchen · April 26, 2022, 2:26am

Thanks @kparzysz . I think the main contention here is whether we acknowledge the need to effectively specify a group of “sub-components”.

When we say target as the fundamental “component”, an implicit assumption is that the particular component roughly comes with a grouped set of compilation(piplines) and they are not necessarily further divisible.

Logically, this view leads to first-class configuration of two things:

V0: The most top-level thing which is the PackagedTarget that specifies the entire package architecture.
V1: The most bottom-level thing which is LeafTarget, in some sense people might want to get rid of host to make it truely a leaf.

The two-level view makes most of the things easy for either V0 and V1.

The other view, emphasize that during compilation it is important to have configuration constraints for function-level, that goes beyond V1.

V2: A compositional constraints that contains “components” for a particular function.

Perhaps the above figure can illustrate the subtleness here.

Imagine we have a system whose host driver is x86, that contains three “components”, CUDA(which runs nvidia devices), cublas(for BYOC) and a remote XPU, that was a driver to a remote device, which again from remote’s pov was driven by a host(risc) and accelerator (accel).

The V0/V1 pov means that we only need to clarify the modular components – each rectangles is a component(V1). And the global architectural setting is the most top level package configurations V0.

A V2 level configuration corresponds to the dashed boxes here that covers the componets that the config intersects. For example:

A corresponds to a function configuration which is the most common setting, a target with host. This is effectively any CUDA function before host/device split.
B corresponds to a BYOC case, depending on the setting can also imply CUDA
C correspond to a case where host-device split is available structurally, and a further split on the remote function is also needed.

V2 effectively acknowledges the need of structurally represent a collection of componets and constraints needed to compile some functions – of course different V2 configs can have overlapped information as all functions need to have the same host for example to be able to call into each other.

A customization under V2 view is also not hard, as each of the sub-component grouping can have its own compilation rules(pipelines) that can leverage subsequent compilation pipelines of its own componet(e.g. CUDA with Host will leverage the host compiler pipeline and cuda pipeline accordingly). In the case of C. It will call into host pipeline, and remote-xpu pipeline, which in term decomposes and call into risc pipeline and accel pipeline.

So one of the main contention pt is how we do divide and conquer

A V0/V1 only view means divide and conquer in a two level way. Effectively de-compose V0 into V1s and solve each V1 separately
A V2 view would acknowledge that during divide and conquer we have sub-steps (for certain functions) that would look into a collection of components (with TargetWithHost being the most common example), and it is important to acknowledge that fact and cover these dashed cases (even though they can overlap and requires consistency checks when annotated on different functions).

junrushao · April 26, 2022, 5:25am

Might be off the topic, but I think @kparzysz has a valid point here:

If we don’t act to clarify the meaning of Target, I believe questions will continuously pop up.

Mousius · April 26, 2022, 12:18pm

I believe the architecture description you’ve described is essentially what CompilationConfig is at present (see: compilation_config.h), which contains List[Target], and VirtualDevices mapping those Target to Devices. In this way, multiple components can reside on a single device which allows Collage to select the best available combination, correct me if I’m wrong @mbs-octoml

I agree, we can remove most of the confusion around Target by adopting your concept of individual components rather than describing the architecture through them. Considering the Target Hooks RFC, I believe we can achieve this rationalisation and removal of BYOC, in favour of each component being described as a Target. You can see this taking form with our implementation of CMSIS-NN whereby it is actually a Target internally:

github.com/apache/tvm

src/relay/backend/contrib/cmsisnn/target.cc

4dc47df36


      
          TVM_REGISTER_TARGET_KIND("cmsis-nn", kDLCPU)
              .set_attr<FTVMRelayToTIR>("RelayToTIR", RelayToTIR())
              .set_attr<FTVMTIRToRuntime>("TIRToRuntime", TIRToRuntime);

The information required to fully utilise it as a Target is currently lost in tvmc, which further motivates the need for the architecture description. Fully implementing the RelayToRuntime Target Hook would then mean that a Target can produce either TIR, TIR + Runtime module or Runtime modules directly from Relay - replacing the BYOC kCompiler flow over time.

kparzysz · April 26, 2022, 1:36pm

In my view, the “architecture” would be the horizontal boxes (i.e. “components”), plus edges indicating connectivity. The graph could be partitioned into connected[1] groups of components, and each such group could be described by the union of the properties of its components[2]. This partitioning wouldn’t need to be done manually, it could also be done dynamically by algorithms trying to match/partition the code to the underlying architecture. I think this would cover all cases V0, V1, and V2. I think it would also allow multiple approaches to code partitioning, whether it’s a two-level view, or function-based divide-and-conquer.

This may be nearly identical to the “LeafTarget” and “PackagedTarget”, but it makes it explicit that the partition (i.e. the “PackagedTarget”) is a derivative concept built from the architecture description (i.e. components and connections).

[1] Connected topologically, i.e. not having isolated sub-groups.

[2] Union may not be applicable to every type of properties, but the idea here is that it would be something that can be algorithmically determined.