Great discussions so far. I think we have a good picture of what the choices are in terms of the data structures(the As), and we have different preferences in terms of choices.
Before we jump into the particular preference, it is helpful to look at different use scenarios that we are using the data structure and objectively analyze them from the following angles:
- The UX interface
- The feasibility of each kind of solutions under the needs
- Possible pros and cons
Notably, the final preferences usually are not disagreements on the objective analysis. For example, I think that we all agree that recursive structure is more expressive, having an explicitly typed config is slightly more convenient than a specific target kind with the same schema for the particular use-cases that involves a two level structure.
Usually our preference is a result of how do we weight the different needs and pros and cons. Additionally, we may have a specific need(use case) in mind. To make a good choice, we would need to look at a broad class of needs. The bottom line is hopefully we can agree on the objective needs and analysis, then use them as basis to talk about the choice(that involves preference).
It is also very helpful for us to review the previous RFCs that comes to the current suggested design of Target and Composite
N0: Common use case, single device with host
While a lot of motivation in config comes from heterogenous devices, which is important. The most common use case we have right now is still the scenarios under a single device. Of course like CUDA, single device usually means there is a need of host driver. So one of the key need is how to make this type of usage as streamlined as possible.
From the userâs point of view, the program itself is as plain as âCUDAâ. However there are two different states of functions during the phase of transformation
- E0: A mixed host-device program
fn () {
// cuda part
b = alloc("global", size)
launch cuda kernel 1 {
}
launch cuda kernel 2 {
}
}
launch cuda kernel 1 {
}
Both E0 and E1 can appear in different phases of transformations. From the usersâ point of view, it is extremely helpful for them to be able to have attributes that specifies the constraints on both kind.
In the convention right now, E0 is achieved by the host field in a Target. While in the case of E1 it is simply a device program. Under the two-level config view. The host of E0 would can be obtained from the context Config(per target_host field).
- From the UXâs pov, directly pass in Target with an optional host field present a simple API for this particular use case.
- Having host under Target would make the constraint more explicit at the function level and differentiate E0 and E1.
- For more complicated heterogenous case, having host under target would cause duplication, in which case a consistency checker and updater is needed.
- Having an explicit host in the target can help the case where there are multiple host env, although this is also a rare case.
I will skip the personal preference comments for now.
N1: Embed into other systems
In a lot of cases we are thinking about generating a program that TVM take full control of allocator, device management and so on. So there can be a temptation to enforce precise heterogenous device info everywhere. On the other hand, at the PrimFunc level, we also need to be able to embed into other systems, and take decisions from the calling env. For example, in most of the cuda op-level case, we generate functions that works on any GPU and switches the context based on the device_id
and type
from the arguments.
For this particular need, we need to keep the target specification simple at the boundary level, that only involves host and device information. While leaving some of the device planning information at the driving part.
N2: Tagging and quick reference
The ability to tag and reference a configuration as a whole is one the key design of the Target system. From the userâs point of view, they do not necessarily cares about the codegen level concept. Instead, it is important to present the target environment as a whole. See the following example tags:
-
aws/c5
: cloud instance name
-
arm/rasp4b
: soc board name
-
nvidia/jetson-nano:cuda
: soc board name
From the usersâ pov, what they ultimately care about is what I want to deploy to. Being able to refer to the setting(or part of the setting) through tagging is an important for that experience.
N3: Represent a complicated heterogenous environments
One of the main motivation of the second level Config is to represent a more complicated heterogeneous environment, that is different from N0. Under such cases, there is a desire to propagate through some of the (virtual) device and memory scopea information across functions.
For this particular use case, an explicit config offers the a clear structure. A specific target kind with schema that follows the config can also implement the same feature.
One possible choice is to model everything in this way, as complicated cases cover simpler setup through another layers of wrapping. Fitting simpler common scenarios into a two-level setting may bring additional complications in UX. Especially if there is an ask for explicit construction.
N4: Ability to decompose
Through out the compilation and transformations. In a lot of cases we are decomposing problems into smaller problems. A function in IRModule can represent
- A multi-machine program into single machine ones
- A multi-device program into driving calls into single-device, host driving functions, but still invokes through PackedFunc(that contains a host part)
- A single device, host driving program into device and host functions.
In the BYOC flow
- A mixed-BYOC strategy program into multiple functions with own BYOC target
- There can be a need for downstream BYOC to further decompose that into graph level executor config, and single kernel code-gen setting.
Through out the transformations we de-compose, and likely also tag the functions with possible constraints(that this particular function must satisfy). Having a common base for the constraints(for functions at different granularity is helpful. Given the nature of the framework is to be able to support and be future compatible to these decompositions.
N5: Automation needs
This ties back to N4. We need a common base config to indicate the constraints that the auto-tuning environment presents. Our most common case right now is single device with host setting. In such cases, target itself is only needed as part of the log.
If we see automation need as the need to be able to search over transformations of a program, subject to certain âtarget constraintsâ. Then naturally we will extend the scope to handle functions at different level(related to N4). For example, graph-level tuning would be one such example.
Considering the need to unify the automation infrastructure, it is certainly very helpful to have a common data structure to represent âtarget constraintsâ at different level(which can include executor configurations) so that there will be one serialization format and relatively streamlined mechanisms to handle all transformation cases(of a single device program, and executor device mixing case).