[pre-RFC] Compilation Configuration Representation

Which leads me to believe we should default to a Config level tag which is the highest level available?

It would remain in the Config form on the IRModule, which means you could have either easily?

Whichever is appropriate for the use-case, having standardised access to that information means you could access whichever is most useful to you. If you want to query the configuration for an appropriate Target and tag a function with it, that’s an implementation detail of another part of the compiler.

Serialising of objects which don’t share a common base is pretty common in many projects, and it’s clear that Configuration encapsulates Target so can call the serialise internally? There’s no need to complicate this by making everything a sub-class of Target. And I believe what @areusch was saying is that we didn’t want anything but Target in the logs as it has no effect? Therefore encapsulating that with some function for creating logs from many pieces of the configuration may be useful?

@areusch and I had long discussion yesterday offline, and he helped me understand the concern from the UX perspective: If we fold executor into target, then it’s more difficult to separate the config coming from two parties, where one party impl the codegen and the other impl the executor.

On the other hand, my concern is the fragmentation of APIs. It has been a huge problem in the recent 1-2 years, and we do have alternatives not to do so.

Here is my proposal:

  • Part 1. Add Exector/Runtime fields to TargetNode:
class TargetNode {
  ...
  Executor executor;
  Runtime runtime;
};

class Executor {
  FromJSON();
  AsJSON();
};

class Runtime {
  FromJSON();
  AsJSON();
};
  • Part 2. Add a helper API to merge Target, Executor and Runtime
Target MergeTarget(Target target_without_executor_runtime, Executor executor, Runtime runtime);
  • Part 3. Allow separate specification of target, target_host, executor, runtime in TVMC, and internally use the proposed API in Part 2 to merge, validate and normalize them into a single Target object
tvmc --target "llvm" --executor "..." --runtime "..."
  • Part 4. For heterogeneous case, annotate the target onto each PrimFunc/RelayFunc to specify the target/runtime/executor
@tvm.script.ir_module
class Module:

   @T.func
   def tir_func():
     T.func_attrs({"target": JSON-Repr-of-Target-Obj}) # with runtime&executor included
     ...

   @R.func
   def relay_func():
     T.func_attrs({"target": JSON-Repr-of-Target-Obj}) # with runtime&executor included
     ...

Could you elaborate on this? I believe this isn’t solely a UX issue but also a hygiene factor within the compiler and how we represent the data structures internally so would rather not overload Target with Executor and Runtime. This RFC is proposing a suitable home for information that’s relevant across the compilation given we now have at least Executor and Runtime to include, but a side effect is bringing the tvmc view back into alignment with the internals of the compiler.

It’s also worth noting, that with the current RFC for Migrating Target Attributes to IRModule tvmc can glue this together with the relevant pieces, so from a user point of view they wouldn’t know how disparate the internals are but it would be a headache to maintain.

1 Like

Wow lots more discussion here! Thanks @junrushao for writing up our discussions. So one thing I’d like to point out is that the recursive Target approach is not more expressive than the approach proposed by this original RFC. Expressing a “contains” relation can be done equivalently well by

  • defining a recursion relationship inside the Target data structure
  • defining another structure which describes the contains relationship (akin to a join table in database theory)

The main reason I am interested in the join-table approach here is that it vastly simplifies MergeTarget as described by Junru above. And, I’d like to point out that it’s not sufficient here to merely define a function which hides the complexity under the covers. Users need to be able to understand what this function is doing because they are writing the inputs (though we are providing a tag, Command Line Configuration Files contemplates an expansion of the role of tagging to include tagging a partial configuration, as discussed earlier. I’m not sure it will be generally simple to explain how MergeTarget works as Target grows if we adopt the general approach of trying to attach every piece of compiler config to some Target which “owns” it.

The drawback of the flat configuration structure is it could be more difficult to consume inside the compiler. We should discuss whether this is truly an issue and how to mitigate it.

Finally, while I do think it’s important to arrive at an an expressive, understandable Target data structure, as the compiler grows more complex, I think there is a tension between a Target structure which is clear to the user and a Target structure which naturally reflects the organization of the compiler (and therefore has the nice properties of clearly delineating where config should live and being easy to route in the compiler top-level). Hopefully, the organization of the compiler is also such that it’s logical to a power user interested in creating a complex config. However, here I think that UX sugar can help to compose the common target patterns such as “cuda” (which really means 1 CUDA device with an implied “llvm” host). We already do this today anyway, so I suspect it will continue to play a role in the future.

@Mousius I totally agree to make things hygiene, and believe folding things into Target is the correct and consistent approach.

First of all, the automation system solely relies on the target object to understand the code dispatching, hardware specs and runtime information. Without having the information in the Target object, the automation system won’t be aware of the full picture. For example, if we switch executor from VM to TensorRT, the performance can be much different, and so if executor is not inside Target, then the automation system will be confused and learn a wrong objective.

Second, as the direction we are moving towards, the Target object is guiding our IRModule-to-IRModule transformation in lowering, and IRModule-to-Module in compilation. Wrapping with an extra layer seems to architecturally change our compilation pipeline, while alternatives do exist and both seem to be equivalently expressive.

Third, the practice folding all compilation-related information has been adopted consistently in TVM. For example, we may specify the libraries dispatched to via cuda --libs=cudnn. Similarly in LLVM, the target triple is designed in a consistent way, where we could specify libc and other environments.

Historically, fragmentation accumulates in TVM across layers. For example, we have different scheduling and auto scheduling system, slightly-but-not-identical and error-prone APIs for different executors, compilation workflow between relay, relay byoc and tvm, etc. Adding new top-level user-facing data structures, if alternative exists with the same expressiveness and UX, then I would say it would probably lead to more user confusion.

On the other hand, I totally agree and am aware that a graph-level compile involves the interaction of multiple parts, including device, host, runtime and executor. The main concern from me here is that we already have Target as a canonical spec formally, which is already able to express this structure without hurting UX.

What about we define a new target kind:

{
  "kind": "packaged", # probably need a better name, please propose new ones
  "runtime": "crt",   # the "runtime" in the proposal
  "executor": {       # the codegen target for relay function
                      # i.e. the "executor" in the proposal
    "kind": "vm/aot",
    ...
  },
  "target": {
    "kind": "cuda",   # the target that TIR generates to
    "host": {
      "kind": "llvm", # the codegen target for the host-side driver code
       ...
    }
  },
}

We can provide helpers to sugar the construction of this recursive target:

def tvm.target.packaged(
  target="cuda",
  executor="aot",
  runtime="crt",
): ...

In the common case, user only need to feed with “cuda”, because we could provide a good default. For advanced use cases, users could use the packaged API to specify their own specification for the package

2 Likes

@Mousius Hello, Where is this work at now ?

@stoa this one stalled out last year in the midst of TVMCon preparation. we’d like to pick it back up now that we’re all back from vacation.

@junrushao based on your last comment, I’m still missing the justification as to why we should stick with a recursive Target. some specific responses:

Can’t the automation look instead at CompilationConfig?

It would be great if you could provide some more illustration here. I think it’s hard to argue this position in the abstract. As a community, we need to make decisions based on the merits we can all observe. Is there a design document you’re intending to propose here that illustrates a situation that would be more difficult in keeping with Target?

I’m not sure I quite see how CompilationConfig changes this aspect. The set of configuration is still bundled together–just not inside something called Target.

I think that part of ensuring a clean design is making conscious decisions about code architecture and layout such that developers feel that paging in each new layer of abstraction is “natural.” That is to say, as the level of detail increases, the concepts build on previously-used concepts at higher levels of abstraction.

CompilationConfig essentially proposes that we organize the user-facing configuration by grouping it according to the logical compiler component which consumes it. This organization allows us to allude to the internal compiler workings using the a user-facing configuration data structure, and allows us to potentially reduce the set of configuration required to unit test a component of the compiler. It also allows engineers to quickly make decisions about where a piece of configuration belongs according to where it’s consumed in the compiler. I would argue that each of these properties allows us to scale the compiler without triggering as many community-wide discussions about config layout.

I think we’ve motivated already that the present Target, while expressive, doesn’t compose well from a user perspective, and that it doesn’t decompose well from an autotvm log perspective. We’re arguing for an improvement in those properties here by illustrating that our alternative using the present Target structure is essentially to define a Target-specific merge() function to compose user-facing Target configs and a Target-specific filtering function to whitelist specific properties in the Target for inclusion in an autotvm log. Both of these tasks are going to significantly increase unit test complexity and load, and if we don’t get those tests right, will equivalently cause user confusion (in the form of “why can’t I specify e.g. this memory layout in a platform configuration file?”).

If my understanding is right, the CompilationConfig will collect all attributes of a module build in a single data structure - this makes sense. It also makes sense to regroup compiler options from PassContext together with the CompilationConfig as well. There may be more:

  • Specific options. For example, the schedule can be chosen differently on the same target depending on whether data are available in cache or tightly-coupled memory vs external memory with low bandwidth or relatively long latency. Same target, different config.
  • User preferences. For example, the user disables data cache for whatever reason, or prefers smaller code/data footprint even if reducing performance, which may require different schedules.

Do you also plan for this kind of “options” be specified via the CompilationConfig ?

@stoa I agree it probably makes sense to move attributes from PassContext if we do this. The tricky bit is that right now, Target (which is what is predating CompilationConfig) is used to key autotvm tuning logs. Given this precedent, it’s reasonable to presume it would continue to key MetaScheduler and AutoTIR logs as well. However, not everything in CompilationConfig probably makes sense to use as an input key there–depending on the extent of the optimization strategy (e.g. autotvm is operator-specific), it probably makes sense to exclude some options (e.g. here we argue for excluding the executor and runtime from autotvm logs).

So before we move to add more things into CompilationConfig I’d like to resolve this issue.

I think it makes sense to include a model of the available memories in CompilationConfig. These would be used to determine where buffers could be placed in memory. I’m not sure we intend to support exactly a “disable data cache” option (this is pretty target-specific), but you could accomplish that by modifying the memory model provided to the compiler. And, target-specific wrappers could be written (similar to tvm.target.Target.micro(model)) to provide a more user-friendly disable_data_cache= option here. Would that accommodate your use case?

Thanks everyone for the discussion so far. We have already got a lot of information about the goals and possible intentions of the design. One thing is pretty clear that the particular choice of data structure does have a decent impact in a few areas.

Before suggesting a concrete choice, I would like us to pop up a level and think about the hidden message about this discussion – How should TVM compilation pipeline, or compilation pipelines(assuming there are many kinds of them) be “configured”.

To help clarify the question I draw the following diagrams. A generic flow in TVM can be roughly summarized in the following picture:

  • We start with an IRModule(modA), possibly already been optimized by user or some previous passes
  • We run a set of transformations passes on modA to get modB
  • We then generate rt.Module from modB, in order to get it running on a specific platform in mind(e.g. a specific board).

We can find that there are roughly three kinds of “config-like” options appearing in this flow and that can affect the final outcome.

  • A0: The specific options used in transformation(e.g. How aggressive we want to inline)
  • A1: The build “constraints” of the platform of interest, this can be the instruction set(x86 or ARM), or runtime constraints(crt, packed-api vs unpacked-api).
  • A2: Within the IRModule itself, there can be additional constraints on existing functions. Imagine that a previous pass/user decided to optimize my_func on nVidia GPU, and have already generated a call to my_func via CUDA runtime API. Then follow up optimizations will need to respect that “constraint”.

To some extent, each of these A are somewhat inter-correlated with each other. For example, if we have a final platform constraint that does not support a vector unit, then it means that we will need to disable vectorization.

Nevertheless there are still two very distinct types of configuration here:

  • C0: In the case of A0: we are mainly interested in “how”, aka procedurally what we do with the program. In many cases, regardless of the transformations(e.g. inlining), the final outcome can run on the platform of interest.
  • C1: In the case of A1 and A2, we are declaring “constraints” imposed by the final platforms of interest(e.g. must have a vector unit, must use unpacked ABI). These constraints information do not dictate “how” we run the optimization, but can provide additional information for certain specializations.

The distinction of the two types are really important here. Coming back to the general goal of TVM. We want to enable composable optimizations of programs. Sometimes this can mean that some previous stages of program transformations are done by another developer, then feed into follow up stages.

C1 type config is something that we usually want to preserve as part of IR or log. For example, BYOC pre-decided that a certain function should be transformed to run on CUDA, then its caller must know that constraint and call using cuda runtime API. Such constraints need to be reflected as part of the IR(aka IRModule) itself, so that followup passes can respect and make use of such information.

C0 type config, however, does not need to appear in the IR(or intermediate data structure of interest). Imagine that we choose to separately inline my_func before handing over the current pipeline. Because the transformation is already done, followup transformations do not need to know this information as the IRModule itself after transformation is already self-contained.

Some of the discussions here started from a single monolithic pipeline, it indeed can be very tempting to consolidate everything into one configuration of interest under that scenario only. I would encourage us to look broadly into the composability perspective of it. Since composability is the key to encourage collaboration without putting too many restrictions about pinning all details about a pipeline. Some of the discussions also touch on this perspective. A very natural consequence of reasoning here is that we need to distinguish C0 and C1 type configurations (Folding pass context config into the whole config as result might go against this principle) in the foundamental level. Of course this does not pre-clude use to create an unified option interface(e.g. at the tvmc level), just at the level of the composational optimizations and things that we will put into the IRModule, we need such separation.

Another related topic is whether or not C1 type configurations can benefit from future dissection/clarification, or if there is enough common ground here to have a consistent branding.

@tqchen thanks for these contextual remarks.

I would like to point out that in targets where we emit something closer to machine code (e.g. edge targets, hexagon, etc), C1-type config can actually inform C0-type config. For example, we may want to run additional passes or modify pass configuration based on the platform chosen in order to apply platform-specific optimizations. So I am really not convinced they are fully separate.

Recording some discussion here between @mousius @manupa-arm @tqchen @junrushao and myself for posterity:

  • A concern with this proposal is that it may cause duplication of configuration on the IRModule. That is to say, if we add CompilationConfig as an IRModule attribute, there is still a need to identify for each function in an IRModule: what sub-Target shall it run under, and what other Targets may it invoke? These questions have bearing on the codegen (e.g. when considering how to implement tir.call_packed_lowered) and on the Executor (when considering the state that may need to be passed down into a function and any additional execution engines which may need to be configured in order to run a function).
  • Meanwhile we still have yet to see a clear motivating example as to why we need a recursive definition of Target. @junrushao and @tqchen could provide some follow-up to this point.
  • There has been some suggestion that autotuning log keys could be defined at a high-level as “C1-type config.” I disagree with this suggestion, as I think it’s likely that both the autotuning method (e.g. AutoTVM, MetaScheduler, AutoTIR) plus the specific runtime/executor config play into this. I think each tuning method is going to need to define a way to cull the Target or CompilationConfig in order to define what goes into a tuning log. If there is agreement on this point, I would like us to focus discussion on this RFC thread around ensuring that whatever data structure we choose here makes it easy to accomplish this culling process independently of considerations of where to place configuration.
  • Finally, this RFC started by proposing an improvement in the user-facing configuration; however, it seems that the part of it which is causing most controversy is that it affects the compiler’s internal configuration state. It may help to have a more focused RFC to collect community feedback around how we should configure the compiler at the IRModule level. Meanwhile, to address the concern of duplicating state above, it would help to see a sketch of how this proposal might suggest we replace Targets at the IRModule and function level. Originally this was left open, but I think it would help to clarify a bit further to understand the impact of this RFC.
1 Like

You are right that C1 style config can inform the pipeline choices of C0-type config, but not necessarily the other way around(as being covered in the discussion). No so much the other way around. This is indeed not a clear cut, but useful enough to think about such a separation.

Just to clarify, one of the main motivations for this is the tvmc argument --config which can be directly translated into the CompilationConfig; however, the structural improvements made using the configuration illustrate how this provides improvements through-out the TVM stack, I didn’t mean to encourage the notion that only the tvmc flow was considered when presenting this RFC.

In TVM today Target annotations are attached to BaseFunc as part of the main compilation flow, as this RFC does not aim to replace this mechanism it would result in an flow such as:

Bare in mind, the only change that has been made is the source of truth for the information is now gathered into the CompilationConfig and provided as part of the IRModule; everything else exists in the TVM compiler today.

I would challenge the conclusion that the distinction is important, to a user the differentiation and different placement of information generally leads to confusion. I’m also unsure where we’re removing composition by allowing users to take an IRModule complete with it’s configuration and transfer that? This seems like an overall improvement in composability to me rather than the current side-loaded configuration in PassContext which has no real structure today. What this leads me to think is that we should introduce CompilationConfig and use it as a mechanism to force better design choices in the way we handle options that can be transferred as part of an IRModule and better aid composability in TVM.

Compositionality is a fundamental philosophy so please allow me to elaborate a bit more here. One great example to show its importance is the design of deep learning frameworks.

In a deep learning framework. The concept of layers is composable. I can take a residual layer, compose it with softmax loss function and optimizer loss. These layers are abstracted under a common interface nn.module. Each Module transforms an object of interest – Tensor.

Tensor itself can be viewed as containing certain C1-type constraint information. Such as the shape, the data content, the device it resides in. Some of the layers (e.g. CuDNNLayer) may only work under the constraint of a gpu device.

Importantly, there are also C0-type configurations. For example, the number of hidden neurons or stages of residual connections. These information are not part of Tensor, because it is sufficient to apply those transformations and Tensor itself contains minimum but sufficient information for followup layers to apply further transformations. Applying more information to the Tensor could create more constraints, and possibly confusion about how to handle those attributes.

Deep learning frameworks are maximally composable; we can compose a residual block with a classification loss(softmax) or detection loss to form different models. These layers can be developed by different developers.

In summary, composability is obtained by decoupling information and clarifying the minimum but necessary information in the key data structure of interest.

Come back to the case of TVM. IRModule is a bit like Tensor, and we are talking about putting all the configurations of the layers, as well as the device information into a centralized place. If we are only looking at one official model(say resnet-256), this of course can help clarify all the options available, but would restrict the evolution to a single pipeline(and forcing all the options into that one place). A deep learning framework approach would be to only keep minimum information(C1-type) in the key data structure, allowing C0-type separately. For specific applications, there might be a centralized configuration(e.g. argparse) which informs C0 type config, but that centralized config is not part of the Tensor.

In summary, putting all the configurations(C0 and C1 kinds) into a single place will certainly improve clarity if we only have a single pipeline in mind. The C0-type configuration brings un-necessary and sometimes confusing information. Remember that pass writers generally need to take the constraints in IRModule seriously, having C0-type information in IRModule would make developers wonder whether or not they should be considered (which is an open set as we grow the set of passes).

In the spirit of minimum but sufficient principle, we want to limit the information attached to IRModule to C1-type information. Note that this does not preclude that a high-level single pipeline builds a centralized config which then propagates to the lower-level mechanism. I believe that was the original intention, and the main reason I brought up compositionality is that at the level of IRModule and pass we will need to consider such separately carefully.

Since the topic of compositionality is quite important. Let us also study a few more examples:

Example 0: Imagine that we want to try out the following thing. stage0: different kinds of unroll factors or vectorization factors, benchmark, do them in a loop then compare the result; stage1: send to the followup lowering optimizations with another set of configs. In this case C0-type config(e.g. unrolling-factor) in stage0 is not relevant to stage1. Given the collection of different choices of stage0 in a loop, it is also not entirely desirable or possible to centralize the configs into a single data structure for this case.

Example 1: Imagine that people want to build alternative compilation pipelines with a different set of configs(e.g. Running through quantization then building). In this case it may not be desirable to couple the two types of config together, since each pipeline may only care about one set.

We can find that most of the examples come from alternative optimization pipelines and choices that may be different from the current build pipeline. These are however, important cases to support so they either can be incorporated into future pipelines, or simply enable more preprocessing choices that composes with the build pipeline.

Compositionality does not imply that we need to ask everyone to use the same config for all possible pipelines . Instead, the main implication is to clarify what is minimum but necessary (and needs to be considered by all passes/pipelines), while leaving out other parts, so it leaves space of flexibility to others.

Coming back to the particular topic of this RFC. I think we acknowledge that it could be useful to have a centralized config for a single tvmc pipeline which can help to bring the clarity. We also agree that the discussion is not about changing the set of information, but mainly about how we organize the infomation. The main point of compositionality is to carefully dissect the two kinds of configurations when it comes to putting the information in IRModule and how do the two kinds of configurations interact with passes.

1 Like

Let me try to summarize the conversation as I understand it – Please feel to correct me if its wrong.

It mainly boils down to the following point :

What should be attached to an IRModule and what shouldn’t ? According to @tqchen’s description above, it should be C1-style “constraints” and not C0-style “how”. The argument being, C0-styled information are a configuration for passes and not broadly applicable to all transform passes, thus confuses the pass implementation with the choice of what to do with them.

According to the definition of C0 and C1, the above information should be classified as C1. Therefore, are we all agreeing to the fact that is reasonable to be attached to the IRModule ?

As a first step, if we all agree it would be great to unblock ourselves to use a C1-styled CompilationConfig attached to IRModule to proceed in short/medium term. @tqchen @Mousius @areusch – An explicit reponse to this question is highly appreciated :slight_smile:

Now coming back, in today’s state of TVM, C0-style broadly refers to PassContext – Im not aware of anything else. Therefore, the current point presented argues against putting C0-styled PassContext either as a) IRModule attribute or b) part of C1-styled CompilationConfig that is already agreed to be attached to IRModule.


Then, for future work, we should think about “necessity” of keeping the C0-styled PassContext as a side channel (or infact a singleton). IMHO, this contradicts slightly with what @jroesch proposal of integrating the whole compilation pipeline as IRModule → IRModule transformations by committing ourselves to maintain a side-channel driven from the need of the separation of C0-styled vs C1-styled information.

Therefore, it would be great to explore options how to attach/package all possible information that “might” (C0+C1) – not just the “minimum”(C1) – be required from all passes. We thought this could be done by attaching to the IRModule – so that we could export without requiring any other side-channels. However, we are open to hear alternatives.

Thanks @manupa-arm , these clarifications pts are helpful.

First of all, the discussion of C0-style and C1-style is more on the design principle level and do not ties into the actual choice of data structure or implementation. In some cases they could imply certain preferences of choices, e.g. calling a C0+ C1 combo as a CompilationConfig certain makes sense, and if it is a C1 only thing something in the target namespace is more natural. But let us first separate these concerns and not talk about the choice of data structure.

The main design principle is as follows

At the IRModule and individual pass(IRModule->IRModule) configuration level, C1 type config should be attached to IRModule while C0 should be handled separately(and not attached to IRModule)

As a first step, if we all agree it would be great to unblock ourselves to use a C1-styled CompilationConfig attached to IRModule to proceed in short/medium term. @tqchen @Mousius @areusch – An explicit reponse to this question is highly appreciated

We agreed that the above information(executor, runtime) are part of the C1-style configuration and it is helpful to introduce a data structure to store those information.

We were originally discussing whether target.packaged(as target was the namespace used for constraint style configurations, consistent with the previous composite target RFC that was accepted) or introduce a separate data structure CompilationConfig(this RFC).

Regardless of the data structure of choice, they are going to unblock the following features since both proposed data structures are isomorphic.

The main intention of the last few post however is to clarify the design principle, since this have a bigger impact than the choice of data structure.

. IMHO, this contradicts slightly with what @jroesch proposal of integrating the whole compilation pipeline as IRModule → IRModule transformations by committing ourselves to maintain a side-channel driven from the need of the separation of C0-styled vs C1-styled information.

The discussion of separation does not advocate for the use of PassContext or any particular data structure. But mainly focus on the importance of separation of two types of informations and only keep C1 style in the IRModule.

This does not contradict the IRModule->IRModule transformation as whole pipeline and actually is an faithful realization. The main motivation of IRModule->IRModule transformation comes from the need of compositionality. We have already have extensive discussions on this point in some of the previous posts.

The IRModule->IRModule principle originated from Tensor->Tensor principle in deep learning system designs, where the key data structure(Tensor or IRModule) contains all the necessary information that are needed to carry among the sequence of actions. They do not imply however, that all the configurations of previous actions(that are irrelevant, sometimes impossible to list comprehensively if we are in a look) should be recorded in the data structure. As we can see in the example of deep learning framework design.

Finally I want to say that there can be a need for a C0+C1 style config(let us call it C2) on a high-level application(one realization of pipeline) that composes the passes together. For example, there can be a train_resnet application that comes with a argparse.opt that contains learning rate, layer configurations as well as the device we want to run on. The C2 object then separately configs the C0 and C1 configs using lower level mechanisms(where they are separately) and drive the end to end compilation.

My read of the RFC is that there is a desire to have something like that. To follow the precedence of deep learning framework modularization. What that implies is to keep the C0 and C1 mechanism, perhaps introduce a C1 target.packed object as the data structure that attached to the IRModule. Then also introduce a C2 CompilationConfig that only ties to one compilation pipeline(perhaps the default one used by tvmc) at a different abstraction level. C2 config will populate the C0 and C1 style configurations.

This would indeed bring a bit more duplications. But like in the case of deep learning frameworks. Such duplication is necessary for modularity at different abstraction levels. Mainly because centralizing everything in C2 is not sufficient for all possible composable pipelines(see the previous examples on searches loops and alternative paths) nor minimum for pass writers to increase composability.

This is a case where precedence designs(deep learning frameworks) are really matured and can serve as really good reference pts. It is also a case where the lesson of deep learning frameworks shows that such choice is critical to the success of the framework as a whole, so it would be good for us to consider that.

Hi @tqchen,

I appreciate your reply with further clarifications, though I’m struggling to reconcile them with the original RFC presented here.

Although they’re isomorphic in structure, the packaged Target has no proven advantage and serves to increase the overall complexity of any additional work in TVM due to the considerations of a potentially recursive Target. I would need a strong motivation for implementing such a complex design, given this RFC aims to reduce complexity by creating explicit structures.

As in my previous post, I didn’t mean to encourage the notion that only the tvmc flow was considered when presenting this RFC. C2 is where tvmc is right now, working around the limitations of the TVM API, with graph partitioning tvmc creates its own pipeline on top of TVM to make the feature simple to use. The RFC is therefore aiming to bring some of the learning from tvmc back into the main compilation flow, with some of the advantages listed in the original post that point towards construction of a configuration, whether it be C0 or C1 for use in any compilation flow.

Taking a step back, if we consider this RFC to be adding the C1 type of configuration, is the requirement for moving this forwards that we must also define a mechanism for C0 configuration? Or can we leave dealing with the global state of PassContext to a future RFC where-in we can discuss how to better manage C0 configuration?

Furthermore, if we accept that C1 configuration can be attached to an IRModule, what prevents us proceeding with the CompilationConfig initially suggested given we still have yet to see a clear motivating example as to why we need a recursive definition of Target?

The advantage of package Target has been extensively discussed in our previous posts.

To clarify, in production, there exists non-trivial usecases with Target. For example, there might be CPU + DLA + GPU case, where Target does need to be expressive enough to represent them. As a simplest example, the config of Jetson is:

TVM_REGISTER_TARGET_TAG("nvidia/jetson-agx-xavier")
    .set_config({{"kind", String("cuda")},
                 {"arch", String("sm_72")},
                 ...,
                 {"dla", SomeOtherTarget},
                 {"host", SomeOtherTarget}}});

In our general principle, we do need C1-style configuration for mixed-device compilation. Notably, this configuration could differ from the IRModule-level annotation, if we intend to restrict the possible constraints during optimization.

Second, in BYOC, there is actual need to pass in additional recursive configuration. For example, some BYOC target needs additional configuration, e.g. as TQ mentioned previously, the composition below is a possible need:

- host: x86
- vdevice0: byoc-myawesome-cuda
    - host: x86
    - runtime: cuda-graph
    - vdevice0: cuda
    - library: tensor-rt
- vdevice1: cuda
- runtime: vm

Overall, the design argument here is a subjective matter. As we can see, in the following discussion, introducing a separate class for single&multi-device constraint also brings additional design and engineering complexity for logging/tagging and the overall compilation pipeline, so it’s really a trade-off.

Notably, packaged Target doesn’t mean it is unstructured or encouraging arbitrary recursion, we can of course enforce schema validation to make it structured and ensure correctness here.

Additionally, we do see benefits to have a common base class for C1 type data structure. From the automation point of view, we need to record the constraint in different cases, and as we have a common base class (Target), it would help with tuning log serialization for both single&multi-device functions. Furthermore, it also brings additional benefit in terms of consistency of representation. As a real-world usecase, if the constraint of a function is annotated as DLA + GPU, it’s relatively easy to narrow it down to a GPU-only function instead if we use the common Target class here - and in this case, it’s helpful to represent DLA + GPU and GPU-only constraint as a common data structure for consistency; the same idea applies to host/device split pass in TIR.

Finally, we would love to reiterate the the advantage of packaged Target and from our PoV, it helps with more generic usecases and maintains the clarity of TVM’s compilation pipeline.

Thanks @Mousius . I don;t think we need to settle down on mechanisms for C0 configuration. The original intention of the RFC appeared to be C1 style but then the discussion drove it towards C0 + C1 style.

So I agree that we should figure out the data structure choice for C1 style configuration in this post.

We all understand and agree the possible advantages bought up a single point setup. Just that there are also other side consideration in terms of compositionality, consistency, and extensibilities. as some of the posts bought up.

The suggstion of C2 style CompilationConfig that translates into C0 and C1 style at lower-level actually is meant to serve as reconciliation here that learns from the previous precedence in deep learning frameworks.