Establish TVM Unity Connection — A Technical Strategy

Co-authored with @denise, @jwfromm

Also thanks to helpful inputs from @jknight @jroesch @areusch @yuchenj @junrushao @psrivas2 @sunggg @slyubomirsky @yongwww @ziheng and many other members of the community.

Summary

At the end of 2021, the TVM community outlined a plan for multiple significant improvements to the project under the banner of TVM Unity. Three of the four improvements are already a part of the TVM project (the FFI interface, TensorIR, and MetaSchedule) while Relax (the fourth) is reaching a state where it is time to talk about connecting them all together. This RFC outlines a plan to connect all four components together for the first time in the TVM project to enable a cohesive (but optional) end-to-end flow in TVM. We are calling this the “TVM Unity Connection” milestone.

This RFC aims to clarify the direction of the TVM Unity effort by:

  1. Reiterating TVM Unity goals and providing extra level of details on the TVM Unity’s unified composable flow.
  2. Describing the TVM Unity Connection milestone
  3. Demonstrating an example workflow made possible through this milestone
  4. Considering possible future opportunities beyond the TVM Connection Milestone

We are excited to bring this new path forward in an incremental way so we can gather feedback from the community. We want to note that this proposal is also carefully designed to minimize disruption to the rest of the project and community.

To help explore and demonstrate the unity connection experience better, we also recommend checking out the MLC course series, which contains several examples of what is possible with TVM unity connection milestones and teaches the material from the basics.

Relation to Upcoming Technical RFCs

This strategy RFC describes our process and desired outcomes in more detail as we work to connect the unity infrastructure together. We also outline the underlying goals as we establish the connection while minimizing disruption. It is broader than and supplementary to the upcoming technical RFCs about specific components including:

  • Relax upstreaming RFC describes the technical details of Relax to enable a unified composable flow in unity connection.
  • tvm.compile: Towards a unified, stable compilation flow for TVM Unity describes our long-term strategy to consolidate default build flows beyond the Unity Connection milestone
  • Bringing support for Relax to AOT describes a plan to support Relax in the AOTExecutor build flow.
  • Future RFCs for additional opportunities enabled by unity connection, such as memory planning, customized fusion, and flexibile library integration.

Our goal is to bring TVM Unity to the community incrementally with minimized disruption. This RFC is the first step. We will work together with the community on these future RFCs as we incrementally bring unity to TVM.

1. Unity motivation and goals

TVM Unity describes a change in philosophy around the way developers interact with Tensor-based workloads through compilation frameworks. In order to support this new philosophy in TVM, four key components were needed (FFI, TensorIR, Meta-Schedule, and Relax).

It is straightforward to understand the motivation for each of those pieces in reading their respective RFCs. However, it can be much harder to see how these pieces interact to bring additional benefits and resolve challenges we faced in our past experiences. And, it is helpful to set a concrete milestone to work towards based on an initial set of user workflows.

The main goals of TVM unity are:

  • Unifying abstractions: TVM Unity will focus on letting AI applications fluidly cross the boundaries between operator graphs, tensor programs, and hardware primitives while supporting both static and dynamic shapes. In other words, rather than forcing a user to either describe things at the graph level (TF-Graph, ONNX) or at the loop level (CUDA, C++), unity enables a user to flexibly describe, transform, and compile ML computations with a single set of abstractions and interfaces.
  • Encourage interaction and collaboration: TVM unity brings first-class TVMScript and python API support, encouraging broader interaction among between ML engineers, ML scientists, and hardware providers through the interactive API.
  • First-class automation: TVM Unity will expose all these cross-layer interactions for automated optimization and enable effective collaboration between domain experts and the automation system.

Besides the above three stated key elements, one common theme of TVM unity is to provide a unified and simplified approach to flexibly customize and enhance ML compilation. This desire comes from listening to users in the TVM compiler community who were struggling with the maintenance and cognitive burden arising from specializations and assumptions that became baked into the pipeline over time to support translations between different stages of compilation.

Unified Composable Flow

Unity Connection aims to provide a way to simplify the overall flow, with the following principles (shown in the figure below):

  • The compilation process and automated search center around the IRModule abstracts graph (relax), loop level (TensorIR), external (FFI), and search primitives (MetaSchedule with AutoTensorization) concepts with a single interface.
  • Transformations are defined as functions take in one IRModule and produce a different IRModule. Transformations are designed to be flexible which avoids the need for special infrastructure built for particular features.
  • TVMScript will serve as a centralized tool for inspecting and interacting with IRModules during the overall set of transformations.

This is made possible by IRModule’s unified abstractions which work together. We will include examples and have more discussions in the incoming sections.

Unity connection alignment in the community

There is already a lot of excitement around TVM Unity and what it can enable because the TVM community is already benefiting from several exciting foundational pieces of TVM Unity that have already reached the TVM repository. The community collectively shipped TensorIR to make TVM more general and fix issues with inherited HalideIR. We then innovations from AutoTVM and AutoScheduler and evolved them towards Meta-Schedule. To enable better user interaction, we evolved introduced TVMScript to enable python-first user interactions.

The final component we are evolving is relax(relay next), which has been developed outside of the TVM main tree by academic and industry collaborators in order to minimize disruption. We think that Relax is now reaching a level of maturity where it would benefit a wider set of attention, use, and feedback from the rest of the community.

Simultaneously, bringing Relax as the final piece of TVM Unity into the upstream TVM repository can now enable an end-to-end optional flow that fully brings these pieces together and makes them accessible to even more members of the TVM community.

2. Unity Connection milestone

This RFC proposes the TVM Unity Connection Milestone as an initial step to connect the unity components together in an optional end-to-end flow in TVM.

This means that users will be able to try out the unity flow by constructing IRModules that contain interactions among graph, TIR, and FFI. They can then interactively transform them and build a runnable module. This process makes the overall compilation flow more customizable and accessible to different users and developers.

At TVMCon Dec 2021, we sketched a conceptual design of TVM unity. Today we realize it as a concretely runnable example and use them to summarize the unity connection capabilities:

import tvm.script
from tvm.script import tir as T, relax as R

@tvm.script.ir_module
class MyIRModule:
    # Define a TIR based operation.
    @T.prim_func
    def tir_mm(X: T.Buffer[(n, d), "float32"],
               W: T.Buffer[(d, m), "float32"],
               Y: T.Buffer[(n, m), "float32"]):
        for i, j, k  in T.grid(n, m, d):
            with T.block("body"):
                vi, vj, vk = T.axis.remap("SSR", [i, j, k])
                with T.init():
                     Y[vi, vj] = 0
                # A2: can be mapped to HW primitives
               Y[vi, vj] += X[vi, vk] * W[vk, wj]

	@R.function
	def relax_func(x: R.Tensor[(n, d), "float32"], w: R.Tensor[(d, m), "float32"]):
        with R.dataflow()
            # A0: Interaction between "graph" and TIR
            lv0: R.Tensor[(n, m), "float32"] = R.call_tir((n, m), tir_mm, [x, w])
            lv1: R.Tensor[(n * m,), "float32"] = R.flatten(lv0)
            gv0: R.Tensor[lv2, "float32"] = R.exp(lv1)
            R.output(gv0)

        # A1: Interaction between "graph" and FFI
        R.call_packed("custom_inplace_update", gv0)
        return gv0f

Specifically, unity connection enables the following interactions (annotated in the code above):

  • A0: Interaction between graph and Tensor Program (TensorIR)
  • A1: Interaction between graph and FFI
  • A2: Interaction between Tensor Program and Hardware primitive (through tensorization).

Enabling these interactions means:

  • We can construct these interactions during model ingestion and use them as an interface for customization — e.g. directly call into both topi (TVM’s operator library) and TensorIR (flexible loop level syntax) in graph construction.
  • We can inspect and transform IRModules that contain those elements to quickly bring in new optimizations.
  • We can build optimization passes by transforming the IRModules and analyzing interactions between different types of abstractions(e.g. graph/TIR).

We have been experimenting with some of the efforts and confirmed that such flow is possible through combination of relax, TensorIR, FFI and AutoTensorization. While some of these capabilities were technically possible to do before TVM Unity, it would have required extensive modifications to TVM internals and increased project complexity. Additionally, many capabilities, such as interaction with symbolic shapes and direct calls into topi, require fundamentally thinking that leads to TVM unity.

Following best practices from other projects as they evolve, Relax will, as a new optional component, be isolated in its own namespace to minimize disruption to the current flow and design organic interfaces to interact with the rest of the unity components. Also, we do not plan to change the current default build pipeline (tvm.build and relay.build), in order to minimize disruption to existing users.

To help stage the set of improvements while minimizing disruption of the current flow. We propose and introduce unity connection as an optional compilation flow of TVM, diagrammed as:

In the proposal, the “default build” (those that run through tvmc and relay.build) that TVM users know today will remain unchanged. TVM unity connection will be added as an optional vertical component.

This is a change that is invisible to the current users who use the existing flow as the default build will continue to work as-is. In the meantime, we give our community members (many of whom are excited about TVM unity) abilities to leverage unity connection early as an optional vertical to enable use cases that may not be supported today (such as dynamic shapes, direct topi integration, and cross abstraction optimizations).

3. Example: Simplified Automatic Scheduling Integration

Here we give one example of how unity connection can help simplify our current flow. Automatic scheduling refers to the ability to be able to take workload, run a search over a composed search space, and replace the optimized program by a new result.

Currently, we need to build a set of non-standard infrastructure (TECompiler) and hooks to be able to extract the tasks and replace the final code with an auto-scheduled version. The TECompiler designed a separate set of mechanisms outside of the normal pass infrastructure and served as a monolithic point between graph and TIR. As we started to add new capabilities (such as BYOC and auto-scheduling), we had to introduce new hooks to this layer, creating a place of growing complexity.

Because unity connection enables TensorIR and graph in the same IRModule, we no longer need TECompiler and related patches that are less maintainable as we expand the capabilities of scheduling variants , instead, automatic scheduling can be achieved through two passes.

  • TaskExtraction: take an IRModule, extract relevant tasks to be tuned, and tune them.
  • ApplyHistoryBest: take an IRModule, and replace the original TensorIR function with the tuned version.

This approach allows us to organically interleave automatic scheduling with other transformations where this is helpful (e.g. fusion, layout-rewrites and BYOC), and also enables us to optionally skip automatic scheduling for some subgraphs if in cases where we want to lower to specific library functions. Such organic interleaving enables future optimization opportunities, such as bringing layout rewrites that are informed by automatic scheduling. We also encourage readers to check out this MLC lecture to see how auto-scheduling and the computational graph interact.

The unity approach also removes hacks needed for the automatic scheduler. Right now auto-scheduler needs to give feedback to the computational graph about possible layout rewrites on model weights. Because previously we only had high-level graph ops, we have to introduce special attributes to each high-level operator to hint at the change. Such hacks make it hard to bring generic weight layout rewrite for a wide range of operators. Unity connection simplifies the approach by expressing the layout transformations by rewriting call_tir in the main function.

The following example shows how meta-schedule can signal future layout transformations naturally after applying the tuned decision. Meta-Schedule first introduces a layout transform on B, producing Bphysical:

@tvm.script.ir_module
class MyMod:
	@T.func
	def matmul(A: T.Buffer((128, 128)),
               B: T.Buffer((128, 128)),
               C: T.Buffer((128, 128))):
         # better layout for B
         Bphysical = T.alloc(16, 128, 8)
         for i0, j, i1 in grid(16, 128, 8):
             with block("B-layout-convert"):	
                 Bphysical[i0, j, i1] = B[i0 * 8 + i1, j]

         for i, j, k in grid(128, 128, 128):
               with T.block("compute"):
                   with T.init():
                       C[i, j] =0
                   C[i, j] = C[i, j] + A[i, k] * Bphysical[k // 8, j, k % 8]

	@R.func
	def main(x: R.Tensor((128, 128))):
            w0 = R.const(shape=(128, 128))
            w1 = R.const(shape=(128, 128))
            lv0 = call_tir(mamul, [x, w0], (128, 128))
           lv1 = call_tir(mamul, [lv0, w1], (128, 128))

In the next transformation, it rewrites the program by lifting the weight layout conversion to the following form

@tvm.script.ir_module
class MyMod:
    @T.func
    def pre_layout_convert(B: T.Buffer((16, 128, 8)),
                           Bphysical: T.Buffer((16, 128, 8))):
          for i0, j, i1 in grid(16, 128, 8):
               with block("B-layout-convert"):
                   Bphysical[i0, j, i1] = B[i0 * 8 + i1, j]

	@T.func
    def matmul_physical(A: T.Buffer((128, 128)),
                        Bphysical: T.Buffer((16, 128, 8)),
                        C: T.Buffer((128, 128))):
        for i, j, k in grid(128, 128, 128):
            with T.block("compute")
                with T.init():
                    C[i, j] =0
                C[i, j] = C[i, j] + A[i, k] * Bphysical[k // 8, j, k % 8]

    @R.func
    def main(x: R.Tensor((128, 128))):
        w0 = R.const(shape=(128, 128))
        w1 = R.const(shape=(128, 128))
        # can be constant folded 
        w0x = call_tir(pre_layout_convert, [x, w0], (16, 128, 8))
        lv0 = call_tir(mamul, [x, w0x], (128, 128))
        ...

On their own, small transformations such as the layout transformation shown here seem somewhat inconsequential. However, by expressing them in Relax + TIR, they can be composed to achieve considerable optimizations. For example, after applying layout transforms to the rest of the operators in the model, some neighboring layout transforms could be cancelled to arrive at an optimized program given the model and hardware constraints.

4. Future Opportunities after Unity Connection

The TVM Unity Connection Milestone demonstrates how TVM users can break down the complex problem of optimizing a Tensor program into small building blocks, each one conceptually accessible to users that typically perform those optimizations. TVM unity also brings additional opportunities further simplify the design and resolve some of our key pain points. This section briefly describes these opportunites, which are considered beyond the TVM Unity Connection Milestone.

Recall that TVM Unity Connection does not seek to replace the overall default flow exposed by tvm.relay.build and tvm.build. Instead, the unity connection milestone will initially expose most of the Unity features so that early adopters can try out the new techniques, give feedback, and continue to develop Unity.

As features stabilize, we plan to gradually introduce them to the default flow. To enable users to do this gracefully, a Relay→Relax converter is under development that will enable users to import programs into Relax.

Our goal is to bring TVM Unity to the community incrementally with minimized disruption. This RFC is the first step. TVM unity also opens the path to simplify and enhance a lot of tasks of interest. We now discuss these opportunities at the architectural level to provide high-level clarity to the overall Unity approach. We will open future RFCs to solidify these projects with more details and involve relevant community members in the process.

BYOC

Right now BYOC is implemented as a hook in the TECompiler. Extending BYOC itself with new capabilities would amount to adding additional complexity to the hooks themselves. With the unified abstraction, we can simplify BYOC to an IRModule⇒IRModule transformation.

For a given MyMod, we want to run BYOC on conv_relu and replace it with an external compiler implemented version.

@tvm.script.ir_module
class MyMod:
    @R.func
    def conv_relu(x: R.Tensor((1, 10, 32, 32)),
                  w: R.Tensor((10, 20, 3, 3))):
        with dataflow():
            lv0 = op.conv2d(x, w, padding=(1,1))
            gv0 = op.relu(lv0)
            R.output(gv0)
        return gv0
							
    @R.func
    def main(x: R.Tensor((1, 10, 32, 32))):
        w0 = R.const(shape=(10, 32, 3, 3))
        lv0: R.Tensor((1, 20, 32, 32))  = conv_relu(x, w)
        ...

The transformation takes the module, runs a code generator for conv_relu subgraph, agree on an external global symbol name to call into it (say conv_relu_lib)

@tvm.script.ir_module
class MyModWithBYOCExtern:							
    @R.func
    def main(x: R.Tensor((1, 10, 32, 32))):
        w0 = R.const(shape=(10, 32, 3, 3))
        lv0 = R.call_dps_packed(R.extern("conv_relu_lib"), [x, w], (1, 20, 32, 32))
        ...

Here we change conv_relu to call into an external PackedFunc, where input/output are allocated on the original side and passed to the library function. Semantically, call tir expands to

def call_dps_packed(func, inputs, out_shape):
    out = alloc_tensor(out_shape)
    func(*inputs, out)
    return out

Additionally, the BYOC pass will generate additional runtime.Module that contains an implementation of conv_relu_lib, which is attached to MyModWithBYOCExtern.attrs[”external_mods”] attribute as per current BYOC convention. The final build will compile the main function MyModWithBYOCExtern along with TIR functions.

There are many advantages of unifying BYOC as IRModule⇒IRModule pass:

  • We can choose when to run BYOC, and interpolate multiple BYOC runs if necessary.
  • Smart search and auto-tuning can be built on top as separate passes that make the decision and dispatch without overburdening the infrastructure.
  • The overall connection point is the IRModule spec: we represent BYOC result as call into PackedFunc and IRModules in external_mods attrs, we can explore many ways of smart BYOC planning without increasing the complexity of the interface.

Framework Fallbacks and Library Integration

Framework integration is another important thing that gets simplified through the unity connection. Right now we face challenges to support a broad set of operators that the framework support. One approach is to quickly fall back to framework libraries themselves. In other cases, we face the need to leverage in-house libraries, such as cuDNN. Previously, the choices of the libraries are hidden behind the dispatching strategies in various places and tied to a single build pipeline. Adding support for a new fallback operator involves changing multiple places.

Unity resolves the problem with first-class support of TVM FFI in the graph. So we can generate direct calls onto environment library functions. The example below calls into libtorch.

@tvm.script.ir_module
class MyModWithLibTorch:							
    @R.func
    def main(x: R.Tensor((1, 10, 32, 32))):
        w0 = R.const(shape=(10, 32, 3, 3))
        lv0 = R.call_dps_packed(R.extern("torch.conv2d"), [x, w], (1, 20, 32, 32))

torch.conv2d is registered as a PackedFunc in the runtime environment and leverages DLPack to enable the exchange of data between TVM runtime and PyTorch in a zero-copy fashion. We can also go further to automate the generation of such environment bindings, supporting most of the library functions that torch support and automatically fall back when the operator is not available.

Similar approaches can be used to directly integrate libraries such as cuDNN, and ArmComputeLib, or build passes that generate calls onto those library functions.

Solidifying AOT

Centralizing things around IRModule also gives us an opportunity to simplify the assumptions in AOT. Specifically, there is an opportunity to spec out what AOT expects as a final IRModule form.

@tvm.script.ir_module
class MyMod:
    @T.func
    def matmul(A: T.Buffer((128, 128)),
               B: T.Buffer((128, 128)),
               C: T.Buffer((128, 128))):
        for i, j, k in grid(128, 128, 128):
            with T.block("compute")
                 with T.init():
                C[i, j] = C[i, j] + A[i, k] * B[k, j]

    @T.func
    def main(x: T.Buffer((128, 128)), out: T.Buffer((128, 128))):
        w0 = T.alloc_const(shape=(128, 128))
        w1 = T.alloc_const(shape=(128, 128))
        lv0 = T.alloc_stack((128, 128))
        call_packed(matmul, x, w0, lv1)
        call_packed(matmul, lv1, w1, out)
        ...

The above code is an example. By solidifying the IRModule spec, we can simplify our development by making assumptions explicit in the form of IR, and build connections (transformations) from the IRModule that contains graph-TIR to this form.

Customize Fusion

As we start to experiment with more advanced optimizations, there is usually a need to customize various behaviors in the pipeline. Fusion is one such example. Right now the fusion passes support fusing elementwise operations onto complicated operations such as conv2d. Let us imagine that we want to customize fusion to support fusing two conv2d together. It was a non-trivial task to hack the fusion engines and moves the needles around them.

Unity enables a more incremental path towards customizing such fusion, with a goal to test hypotheses quickly and unblock our product needs, while continuing to evolve towards general solutions.

Step 0 Manually annotated fusion

@tvm.script.ir_module
class MyMod:							
	@R.func
	def conv_conv(x, w0, w1):
			R.attrs({"primitive", 1})
			with dataflow():
				lv0 = op.conv2d(x, w0, padding=(1,1))
				lv1 = op.relu(lv0)
				gv0 = op.conv2d(lv1, w1, padding=(1,1))
				R.output(gv0)
			return gv0

	@R.func
	def main(x: R.Tensor((1, 10, 32, 32))):
			w0 = R.const(shape=(10, 32, 3, 3))
			lv0 = conv_conv, [x, w0, w1], (1, 20, 32, 32))
			...

The first step is to have the developer manually generate the above program (with help of possible tools like IRBuilder) in the ingestion phase to annotate conv_conv. The fusor will take this information and respect the decision made by the user.

This is a first step to testing out the hypothesis of e2e perf without even hacking into the fusor itself.

Step 1 Pattern-Based Fusion Composed with GenericFusor

As a second step, we do not aim to change the GenericFusor. Instead, we write a simple pattern-based fusion that detects conv_conv patterns of interest and generate the code with the above result.

  • IRModule⇒Conv-Conv Fusor⇒GenericFusor

The generic fusor respects the results from pattern-based fusors. Notably, Conv-Conv is much simpler to implement (likely a few days of eng cost) and there is no engineering complexity to change GenericFusor. This step already enables conv-conv fusion in production if necessary.

Step 2 Generalization

As we accumulate more patterns, this is a step to generalize and update GenericFusor to support generalized cases(e.g. conv1d-conv1d fusion or fuse of any two conv ops). This step can be done separately without blocking immediate needs.

Customized Operator Support and Direct TOPI Integration

As we start to bring in more ML models, there is a need to enable a broad spectrum of models. Ease of support for customized operators is a common need. Unity offers irbuilder.emit_te(topi.matmul, inputs), which effectively takes a TOPI function and constructs a call into the relevant TIR function. The direct TOPI integration enables us to quickly build up custom op through TVMScript/TE without surgical changes to the codebase.

Evolving Default Build

This RFC proposes a first step for us to establish Unity Connection as an optional end-to-end compilation flow of TVM. The TVM unity connection keeps the default build flows of tvmc and relay.build unchanged in this RFC. This section describes possible approaches to evolve the default build to leverage more of unity components. Note again that this is only a discussion of possible approach is not part of the RFC, but only to bring possible relevant context of future opportunties.

The exact approach to take and detailed discussions shall happen in dedicated RFCs for each incremental change to the default build.

One approach to enable unity connection is to start from the low-level part of the default build. Enabling a greater amount of models going from relay (which optimizes graph op) to relax. We will then bring more capabilities to the unity connection while simplifying some of the low-level components.

The overall components can be bought into the default pipeline in the unit of features. We can start with TIR codegen and lower-level graph TIR-related optimizations. It is important to note that, unity connection can help to simplify some of the default build solutions as discussed in other opportunities in this section.

16 Likes

It is very exciting to see Unity milestone coming in! :rocket::rocket::rocket:

I would look forward to exploring the new opportunities that Unity would unlock, especially with the unified abstractions and composable compilation flow. Hope this would be a nice way to efficiently try out new ideas that has been difficult, such as joint-optimization across different abstractions and BYOC components.

1 Like

Thanks, @tqchen for bringing it up. TVMUnity provides more flexibility for performance improvement and new hardware integration. Here I’d like to share some of my experiences during using TVMUnity:

Cross-layer Optimization

Layout is a cross-layer element for end2end models, influencing both graph-level representation and low-level TensorIR optimization. Most existing works (e.g., TensorRT) optimize layout by specifying an “optimal” layout by human experts. However, a full automation solution for layout rewriting (both weight layout and data layout) will be possible with TVMUnity infra.

Interactive Transformations

Interactive transformations are introduced by TensorIR first and have received lots of positive feedback. It would be an exciting milestone if we could use one programming language (i.e. TVMScript) for end2end models, and every transformation works around one central conception: IRModule.

Also, Relax will solve some of the limitations of Relay, for example, dynamic-shape and training infra in the future. Happy to see this pre-RFC and look forward to the following upstream commits :slight_smile:

8 Likes

It is exciting to see this strategy RFC! Here I’d like to share some of my experience and expectation for TVM Unity:

Advanced Semantics

Specifically, I look forward to the ability of easily representing advanced semantics beyond of dataflow brought by the Unity, like control flow, in-place updates, and side effects, etc. This would unlock a lot of new opportunities like training and dynamic workloads, and bring TVM to the next stage.

Modularized Lowering Pipeline

Also, I expect a more modularized and incremental lowering pipeline, so that developers can easily experiment their own transformation without modification to the whole codebase. This would lower the bar for developers and bring new features easier, so that TVM can benefit from a broader community.

Looking forward to those and l support this RFC as a PMC member.

2 Likes

Thanks! @tqchen

Really excited to see the announcement of this milestone. In the past few years, I’ve met so many talented guys in TVM community and learned a lot form this great open source project.

It’s also a great pleasure for me to put TVM as a key component in my work. Currently we benefit a lot from BYOC and AutoScheduler to build a powerful engine for our AI workloads.

Besides these plans on the improving of kernel performance, better treatment on dynamic shape and unified abstractions, from industrial perspective there’s still other things we can improve:

Out-of-the-box usability

Even though we have detailed code comments, documents, tutorials and well designed (at least I think the community tries to design them good) APIs in TVM, it’s still hard for new users to start learning and using TVM.

For learning TVM, MLC course(wow! wonderful!) is now guiding more peopel to know what’s ML compiling and the basic of compilers.

For using, I have to say that even as one of the core developers of Ansor, it still takes us long time to really bring it from paper to real industry workflow.

I think developers would be glad to see more examples around Unity:

  • More runnable tutorials about features in MetaSchedule/TensorIR, and what’s the different with the old TE way
  • Maybe build a model zoo for classics models with Relax/MetaSchedule/TensorIR? Then anyone can use it to simply reproduce SOTA performance, learn how everything works and participate in improving it
  • A kernel zoo to benchmark the performance between MetaSchedule/TensorIR and other libs(MetaScheduler vs cudnn, cutlass on CUDA …)?

Unity on runtime

We have several basic runtimes in TVM for example static shape GraphRuntime, dynamic shape VMRuntime …

I’m not sure if Relax has considered scheme like compile the static shape parts of a whole graph to GraphRuntime and put it to run inside VM, or pack a subgraph with dynamic shape while be static in inputs/outputs to VM and put it to run inside GrpahRuntime.

To extend this idea, any runtime / module can be composed like toy bricks!

Unity on upper level abilities like quantization …

Relay & TE are enough for us to optimize and lower a quantized AI model(in FP16/INT8 or lower precision) to executable objects, and I’m sure Relax/MetaSchedule/TensorIR can do better on the performance.

While it’s now a pain point for us to realize our own quantization algorithm inside TVM, maybe it would be easier to just introduce the ability of another mature tools? A workflow like:

  • realy graph → Processed by another tools/frameworks → relay graph → compiler & lower → …

Sounds like another kind of BYOC :grinning_face_with_smiling_eyes:


Looking forward to change our industry pipeline from AutoScheduler/TE to MetaSchedule/TensorIR and feed our engineering work back to the community!

5 Likes

It’s great to see this RFC. TVM Unity has been very helpful to me in my development. Let me share some of my experiences of using TVM Unity.

The first great point of TVM Unity is the cross-layer interaction. I have been working on layout rewrite for a long time. Implementing weight layout rewrite (as @Hzfengsy mentioned before) would be frustrating because we can only hint the compute graph with the possible optimal layout using some hacky methods. With TVM Unity, mostly thanks to the Relax component and the cross-layer interactions, I’m able to do layout rewrite using some combination of simple passes.

Another benefit from TVM Unity I’ve witnessed is the unified composable compilation flow. During my development, it’s common for me to make some little modifications to the IRModule and see how it will change some of the attributes. It was once hard and hacky, but after TVM Unity was introduced, things became much simpler. Using just a single pass, I can transform the IRModule and inspect the intermediate form clearly in the form of TVMScript.

Looking forward to seeing more community members try out TVM Unity and put it into practice.

1 Like

Great to see TVM unity achieving a new milestone! Thanks @tqchen and other authors for this RFC. I firmly believe that the idea of TVM unity will bring machine learning compilation to a next new level.

Lowering the Threshold of Machine Learning Compilation. With unity and the Python-syntax-based TVMScript representation of the core data structure IRModule, we enable to bring different level of program abstraction into one unified representation, which makes it much easier for newcomers, especially ML scientists and engineers, to learn the key components and key abstractions of machine learning compilation, and further understand the workflow of compilation. In this way the threshold of learning and understanding machine learning compilation and TVM gets lowered, and consequently more people will use TVM and join our community to further improve TVM.

Envision of Graph-Level Auto-Tuning. Currently Meta-Schedule mainly focuses on operator-level (that is, TensorIR-level) tuning. But with different levels of abstraction coexisting under a same IRModule, in the future it is possible to bring the scale of Meta-Schedule to graph level, by trying different operator fusion strategy, different layout rewrite strategy, etc. By first generating potential graph-IR template and then generating followup operator-IR template, the tuning of a whole network seems more possible than ever.

Support for Emerging Application Compilation. With dynamic shape being an important feature of Relax IR in unity, machine learning applications with irregular data pattern (for example, models with sparse tensor computation) can get better supported across the whole compilation stack. It is true that the infrastructure needs to be furthered improved in order to support new applications, but the type and shape inference mechanism of Relax greatly reduces the amount of efforts we need to support the applications on the top of the current stage. What come together are the possible graph-level optimizations for the operators with irregular tensors, which I think could be designed and implemented in a more convenient way.

The plan of unity extends a broader room for the whole community to explore, and provides much convenience for us to learn, use, develop and improve TVM. Let’s work together to land TVM unity :laughing:!

1 Like

Thanks! @tqchen and who contribute to TVM unity.

We are waiting for TVM unity, especially Relax, so excited to see this RFC. I’ll briefly introduce the pain points of deep learning compliation on our DSA, and what TVM unity can bring to us.

  1. How to write operators. It’s hard for DSA having a language like CUDA, since most of DSAs are SIMD architecture, not SIMT. So the low level language of SIMD DSA is full of SIMD intrinsics, which are very challenging for low level compilers like llvm to emit them automatically. Writing ops with low level languages is cumbersome, TVM script is better way if we (DSA vendors) can do auto-vectorization/tensorization in tir.

  2. How to fuse larger subgraph and do more aggressive graph scheduling. The high bandwith and large memory capacity are difficult to get for a DSA, so we need to do more aggressive graph scheduling and stitching for bigger subgraph. In the past, we can not fuse a compute op with a ir builder/tvm script op, with new TVM unity, it’s possible to do this. Also, we need to interact between relay and tir when do optimzations, for example, we decide where we do Copy Elimination on relay, and then apply it on tir. With TVM unity, we can do this more conveniently.

  3. The dynamic shape support. This would be big complement, lots of customers concern our dynamic shape solution. With Relax, we will support a DSA dynamic shape solution. Can not wait.

  4. Auto schedule solution with TVM unity. For now, we still write tir schedule templates for ops. Exploring a DSA auto schedule solution with TVM unity is a sure thing to do.

I believe that we can cook a better DSA deep learning compliation solution with TVM unity.

3 Likes

Thanks for @tqchen for bringing this up, I love the idea of TVM unity which unifies many scattered optimizations as IRModule transformations.

Regarding future opportunities, I would like to name a few besides the items mentioned before:

Distributed Abstraction

Tensor Compilers such as DISTAL and SpDISTAL compile program to Legion, where user can specify how to distribute the workload and the communication patterns via schedules.

In Deep Learning, distributed computing is important not only for tensor expressions, but also computational graph partition. Alpa gave us an example how compiler techniques could benefit automatical parallelization. Support distributed abstraction would be the first step towards distributed training with TVM.

Sparse/Irregular Abstraction

SparseTIR is our recent effort in support sparse/irregular computation in TVM. While we only study the tensor-level IR support. Considering its role in TVM Unity is interesting:

  1. The performance of sparse operators are greatly influenced by formats, nevertheless, changing the sparse format would influence the computation of preceding kernel and following kernel in the computational graph, searching for best format rewrite rules requires global information about the computational graph.
  2. Larger space for kernel fusion, some operators such as Conv with different kernel sizes cannot be fused together, however, with sparsity support at TIR level, we can described the fusion of kernels with different shapes as a single kernel performed on a ragged tensor, which might improve overall performance when each of the kernel is light.
  3. Extending the sparse annotations in Relax: MoE(Mixture-of-Experts) leverage sparsity at computational graph level rather than tensor level. If we view both of them in TVM unity, the only difference is the IR level (TIR or Relax), and we face similar challenges in both settings (e.g. load balancing),.

Program Rewrites such as FlashAttention

I recently read the paper FlashAttenion and found the idea quite interesting: to avoid materilizing attention matrix on chip which helps saving GPU footprint, FlashAttention fuses softmax(QK^T)V into a single operator. FlashAttention tiles the operator into blocks, while the computation on each block are performed on SRAM instead of HBM, thus improving memory efficiency.

The difficulty of tiling fused operator comes from the softmax whose value is not local to blocks. FlashAttention exploits the idea of Local Softmax where they first compute a block-wise softmax value and scale the result after computations on all blocks are finished.

The idea of LocalSoftmax is neat, my question is can we automatically discover how to tile fused operators like FlashAttention. This is doable if we analyze the body of Tensor-Level expressions.

3 Likes

Excellent annoucement, hope TVM can be more modular and customizable

It’s great to see Unity finally comes into a more complete picture/strategy. I was excited about it at the talk last year. As a community member and also a (former-)developer of some components that are listed in the RFC, I think I could share a few thoughts on some aspects of Unity design.

BYOC

The current implementation of BYOC is actually more like a little DSL itself. It first asks users to use different approaches to annotate the graph or kernels with hardware specific features. Then, it runs a few passes to clean up the annotations and form some external functions (i.e. the one that is friendly wrapped using tvm script above). Finally, some logic is attached to TECompiler to separate and compile tvm code and external functions to create a monolithic fatbin.

While some passes in the flow follow the IRModule->IRModule transformation, the whole process couples with the optimization and compilation flow of relay.build, which we tried to somehow refine a few times to improve code maintainability and help users (particularly HW vendors) more easily bring their in-house codegen.

I like the Unity design of offering users more flexibility to implement their own function and use an external PackedFunc to invoke it. I think this makes optimizations on the external functions more conveniently and binary packaging easier as well.

Dynamic Shape

This was really a big headache before Nimble was invent. But I think some problems were still not fully resolved and it would be great if Unity can solve them all together.

  • The first problem was the support of TF dynamic arrays (i.e. TensorArray). We used quite a bunch of ADT auxiliary functions to express the TensorArray semantics. This raises the bar of code readability by itself, plus it brings some execution overhead since more bytecode instructions will be emitted.
  • Second, efficient codegen for dynamic shape was not fully supported. I think DietCode attempted to solve the problem. Would that be integrated in the future or are we exploring some new techniques?
  • Third, automatically tuning kernels with dynamic shapes remains a challenge. It would be great if it is considered by MetaSchedule.

Other comments

  • Fusion. I think providing the possibility of explicit fusion is quite helpful as a person who has experience on quite a few accelerators. This gives users flexibility to different fusion rules (e.g. even the customized ones) easily and the current fusion pass can still be used as a complementary tool (I think it is already quite challenging for users to customize their own fusion rule on the current pass). The pattern based fusion can probably borrow some ideas from our current pattern matching. Allowing users to conveniently specify patterns should be considered as well IMHO.
  • Framework fallback. There were some similar attempts in the community and industry before IIRC. Some projects wanted this to reduce the operator coverage efforts and then quickly enable model supports. I think this is a nice and useful addition to the project and it would benefit many users.

BTW, do we have a concrete roadmap that when we will bring these pieces to the community and tvm users? In addition, I think Relay-compatible is a good strategy since there are still many systems deployed using Relay. My two-cent is we can probably provide a flag (e.g. --enable-relax)to the users so that Relay->Relax conversion can happen transparently to the users.

1 Like

Thank you @zhiics . On our overall strategy, we would like to bring the first minimum piece (e.g. the unity connection milestone) as a first step. I also agree that as the compilation evolves, we can provide a flag that enables relay to relax conversion and allows some of the relax optimizations to happen and gradually improve some of the relay pipeline.

1 Like

Thanks everyone for great discussion so far, here is a formal voting thread on adopting the strategy https://github.com/apache/tvm/issues/12651