[DISCUSS] TVM Core Strategy for Emerging Needs

Background

It is a great and challenging time to be in the field of AI/ML. Over the last year, we have witnessed a great number of innovations with the arrival of foundational models, including stable diffusion models for image generation, whisper for voice recognition, GPT, and open LLMs(llama2, MPT, Falcon, RedPajama).

We have learned a lot of lessons throughout our attempts to support the ML/AI ecosystem over the past five years. The opportunity is great for us if we act swiftly in the right time window. This post aims to capture our lessons over the past five years and discuss core strategies moving forward to support continuous innovation and emerging needs, such as dynamic shape modeling, stable diffusion, and large-language models.

High-level Goals

  • G0: Enable innovation and growth to support emerging needs, such as new strategies of compilation, structured optimization, distributed settings.
  • G1: Integrate components organically with clear core abstraction.
  • G2: Connect to and amplify existing ML engineering ecosystems, including libraries like cutlass/CoreML and frameworks like PyTorch.

Past Lesson: Build centric approach is not sufficient

In the past, most of our approach has been a build flow centric view. The high-level model compilation is organized in a build flow, and presented to the users as a closed box. This approach serves us to some extent in optimizing our past needs.

The main issue of this approach comes when we start to support emerging needs over the past few years. As we work to solve challenges, we start to introduce new mechanisms into the build flow, such as BYOC, memory planning, shape handling, and library dispatching. The following figure depicts the build-centric approach we took in the past.

There are several limitations when we use this approach to tackle growing emerging needs:

  • Each new mechanism are coupled with the build flow in a fixed way. We often need to manage the interaction between mechanisms(e.g., te scheduling, backend dispatching, and shape handling).
  • It takes time to try out quick changes, such as dispatching a single layer to a new op library.
  • Customizing new operators involves touching all the mechanisms along the build flow, including graph operator, lowering, and sometimes runtime.

The primary source of complexity comes from two factors: (a) The necessity grow set of mechanisms to support emerging needs. (b) Implicitly assumption of each mechanism imposed on the build flow and complexity of their interactions. Each mechanism can come with its own set of data structures and configurations and serves a subset needs of the community. One possible attempt to tackle the problem is to freeze the overall build flow as much as possible and avoid introducing new mechanisms. This could be a fine approach for some of the existing needs. However, as discussed in the beginning, AI/ML ecosystem evolves much faster and the key to success is to come up with infra that not only works for today’s needs, but is extensible to new demands coming up.

Unity Core Abstraction Centric Strategy

As discussed in the previous section, our primary concern is to capture our past lessons and bring a unified approach for emerging needs. We list the design principles as follows:

  • D0: Core abstraction that serves as bridge for all emerging needs.
  • D1: Universal deployment runtime that works with IR abstraction for general library extension.
  • D2: Each new mechanisms are built on top of the core abstraction and runtime.
  • D3: Specific flow composition on top of the core, enabling quick evolution.

D0 means that the compilation is not tied to a specific set of pipelines. Instead, we will focus on a core abstraction (unity core) that can capture the input/output of most optimizations. The particular abstraction would be representable through TVMScript. With D0 and D1 in mind, the overall key feature of TVM unity becomes more “flat”. Every state of IRModule has a reasonably “minimal” build — if modA contains unoptimized loops, building it will simply result in a module that runs the unoptimized loops; if modB contains something that does not tie the memory allocation together, the resulting build will simply dynamically allocate tensor during runtime. Instead, the additional mechanisms, such as memory planning or loop optimization (scheduling), are all part of the IRModule transformation.

The main benefit here is a great improvement in compositionally and development productivity — we will be able to apply BYOC and different approaches to TensorIR transformations and compose them together. We will also be able to use the same mechanism to integrate different library and compilation approaches together.

Finally, not every transformation needs to be aware of each other since the IRModule and the core abstraction are the common ground among the transformations. We can build customized transformations outside the main repo for a specific workload to gain quick experimentation, then bring some of the lessons back to a common set of utilities.

To realize these goals, we need the core abstraction to be able to represent and optimize the following key elements for emerging needs:

  • N0: First-class symbolic shape support: ability to track and analyze shape relations.
  • N1: Ability to represent computational graph, tensor program and libraries together.
  • N2: Ability to extend to bring first-class structural information support, including first-class support for multiGPU/distributed setting and structured sparsity patterns.

Based on our past lessons and current technical state, we pick relax and TensorIR to form the core abstraction, with additional first-class support for universal deployment runtime, which we will discuss in the next section.

Relax naming interpretation. We know that there are some different background about how to interpret the namespace relax. Base on the current state of the community, we use the name relax to refer to “computational graph abstraction with relaxed assumptions for emerging needs”. Specifically, N0, N1, and N2 all align this with perspective. This is an evolved view of our previous computational graph designs (nnvm and relay) that come with a stronger set of assumptions for a build centric approach. Next, we will also list a few examples of how the unity abstraction-centric approach plays out for emerging needs.

E0: BYOC and Library Dispatch

Library dispatch or more general BYOC refers to approach where we replace a local function or sequence of operators into library calls.

@tvm.script.ir_module
class Before:
    @R.function
    def main(x: R.Tensor(("n", 4096), "float16"),
             w: R.Tensor((4096, 4096), "float16"):
        with R.dataflow():
            lv0 = R.mm(x, w)
            gv0 = R.relu(lv0)
        return gv0

@tvm.script.ir_module
class After:
    @R.function
    def main(x: R.Tensor(("n", 4096), "float16"),
             w: R.Tensor((4096, 4096), "float16"):
        with R.dataflow():
            lv0 = R.call_dps_packed(
                "cutlass_matmul", [x, w], R.Tensor((4096, 4096), "float32")
            )
            gv0 = R.relu(lv0)
        return gv0

call_dps_packed enables us to call into the cutlass_matmul function, which can be generated by BYOC, or simply registered to the runtime if there is a fixed signature.

E1: TensorIR Scheduling

The original TE operator scheduling is done in a coupled fashion between compute and loop scheduling. The set of autotvm templates, or topi schedules, is tied to each of the operators. Additionally, TE scheduling is a separate mechanism. Following the abstraction centric approach. We instead build scheduling and optimization functions as IRModule transformations.

These transformation scheduling rule contains two steps:

  • Analysis of loops to detect related patterns, such as reduction and tensor contraction.
  • Apply (possibly target dependent) transformations based on the analysis, and create targeted search space that is derived from workload/target, or a few tunnables expressed in MetaSchedule.
  • If necessary, we can also have block specific tags that enables a more customized rule that do not depend on overall.

By default, we favor out of box rules that gives good performance. The tunable can be defined to further improve performance exploration. In such cases, we separate tuning from application of the tuned logs. Specifically, the build flow will contain a ApplyHistoryBest that will look up and transform the related TIR functions and such application works in the same way as out of box heuristic transformation passes.

E2: TensorIR and Graph Co-optimization

Importantly, the core abstraction needs to enable co-optimization across loop level and computational graph operations.

import tvm.script
from tvm.script import tir as T, relax as R

@tvm.script.ir_module
class IRModule:   
    @T.prim_func   
    def mm(
        X: T.Buffer(("n", 128), "float32"),
        W: T.Buffer((128, 64), "float32"),
        Y: T.Buffer(("n", 64), "float32")
    ):
        n = T.int64()
        for i, j, k in T.grid(n, 64, 128):
            Y[i, j] += X[i, k] * W[k, j]

    @R.function
    def main(
        X: R.Tensor(("n", 128), "float32"),
        W: R.Tensor((128, 64), "float32")
    ):
        n = T.int64()
        with R.dataflow():
            lv0 = R.call_tir(mm, (X, W), R.Tensor((n, 64), "float32"))
            gv0 = R.relu(lv0)
            R.output(gv0)
        return gv0

The above code examples shows how we can mix computational graph and TensorIR program(mm) together. We will be able to build transformations that co-evolve both parts. Including, but not limited to:

  • Enable TensorIR to suggest certain preprocessing decisions(e.g. layout) back to graph level.
  • Analyze the TensorIR for possible fusion opportunities.
  • Lift allocations in TensorIR into the graph level to enable global planning.

Universal Deployment Runtime

Besides the core abstraction, the runtime plays an equally important role to enable emerging needs. The TVM core runtime contains the following elements:

  • R0: A set of first-class core data structure (objects, function, ndarray) that can be universally embedded and accessed in the users’ languages of choice, including c++, javascript, python
  • R1: PackedFunc convention that enables generated code, and libraries to call into each other.

We believe that such minimum and universal runtime is critical to our approach. Looking at the high-level approaches, we can see there are roughly two kinds of ways to think about compilation.

A0: fully internalized approach If we look at traditional use of languages like gcc, runtime is less as important, as everything is internalized. The result of compilation is a simple executable that handles everything in the deployment flow

gcc -o main input.cc
./main

A1: open interpolated and integrated approach If we look at a more interpolated approach, the result of compilation is different. The result of compilation is a module, that contains a collections of functions that are directly accessible in different environment languages like python. Each of those functions can take and return advanced in-memory data structures such as GPU NDArray, or even torch.Tensor(via dlpack). The application is usually build around these compiled functions, but also have supported mechanism around to compose these compiled functions (e.g. make them work with torch.Tensor).

import tvm.script
from tvm.script import tir as T, relax as R

@tvm.script.ir_module
class IRModule:   
    @R.function
    def prefill(
        X: R.Tensor(("n", 128), "float32"),
        params: R.Object
    ):
        ...

    @R.function
    def decode(
        X: R.Tensor((1, 128), "float32"),
        params: R.Object
    ):
        ...

mod = mlc_compiler.build(IRModule)
res0 = mod["prefill"](input_array, params)
res1 = mod["decode"](decode_input, params)

Recap of approaches listed above

  • A0: fully internalized approach
  • A1: open interpolated and integrated approach

Additionally, the philosophy of A1 also means we need to enable first-class integration of functions from different languages, such as customized torch code, or cutlass cuda kernels depending on user needs.

While both A0 and A1 could be a fine approach in a traditional setting. Most of the traditional compilation stack would starts with A0. However, in the context of ML/AI, we find that it is important for us to take A1 to be successful.

This remark is based on our observation that ML ecosystem benefit from collaborations from various fronts, where compilation and integration can happen organically at different levels. A1 also places a less restrictions on the developer, as they would be able to take the language of their choice for deployment (and sometimes optimizing part of the workloads) and we can bring the infrastructure to them.

Finally, taking A1 enables an incremental development path — we can always start with custom defined data structure and code for key optimizations such as KV-cache, while gradually evolving to advanced code generation techniques with the same interface.

Going through A1 do place a stronger needs on the runtime, as the runtime would need to have R0 and R1 elements. We had a strong TVM runtime design that can serve as the as the foundation.

We remark that A1 is in generally a must have for emerging applications like LLM applications, where the application layer needs to have a broad set of interactions with the generated model (such as prefill, decode, or advanced batching updates).

For special scenarios, e.g. when the runtime env have very limited resource, we might enable a more restricted set of features of A1 (e.g. limit kinds of dynamic data structures). Such decision can be left to community members to bring scoped modules that interacts with the core abstraction.

Enabling the Core Strategy for the Community

This section talks about the mechanisms to bringing the unity core abstraction centric approach for emerging needs. We acknowledge that the build centric approach can handle some of the existing needs. So we would enable the unity core abstraction to co-exist first with the existing modules. Existing modules will continue to function in their respective namespaces and support some of the existing needs.

With the unity-core abstraction centric view in mind, we anticipate future usages of the TVM stack can involve customizations and improvements for different verticals (such as LLM, stable diffusion) while sharing the common abstraction principle as the bridge and common ground.

The complexity of emerging needs will go into the unity core abstraction centric approach and being managed in a unified fashion. As the abstraction centric design enables great rooms for continuous innovation in areas such as structured sparsity, distributed and other settings.

The adoption of approach can depends on vertical and the the interest of the community. We anticipate the new foundational models (LLM, stable diffusion, distributed workloads) will start from unity core abstraction.

Depending on community interest, we also anticipate community to enable some of the existing verticals through the unity abstraction centric approach on a per vertical basis. We provide tools like relay translator to facilitate such transition. Additionally, we also bring frontend importers such as pytorch and onnx to support first-class dynamic shape import.

We do not anticipate a one to one mapping from existing approaches to the new one in a fine-grained fashion, since the nature of core abstraction centric approach is providing an alternative, more flexible approach to the overall problem. Many problems like operator fusion and memory planning will get simplified in the new approach. Instead, we will think about the vertical needs(e.g. supporting image labeling), and enable the vertical using the new methodology. The enablement of per vertical basis also empowers sub-community to make their own call of when and what to do.

As an open source project, the execution of the strategy of course depends on community. We will access the state depending on community engagement and needs, and empower more people to have their vertical supported. Due to the fast moving nature and diverse needs, we will empower a more extensible approach with strong foundational principle that ties things together (as things being centered around the abstraction).

Because we are also community driven, we would anticipate growing importance of importance of emerging needs and alignment on that direction due to community’s interest — as being shown in our latest surveys and discussions around foundational model. This approach allows us as a community to continuously access the situation at a per vertical basis and make timely informed decisions based on the technical developments and the state of ML/AI ecosystem.

12 Likes

We would love to see everyone’s idea on how to support foundational models and other emerging directions better and how can we position ourselves in the new ML/AI ecosystem, you are more than welcomed to checkout the posts below where there are great set of ideas

1 Like

This is not part of the strategy but useful to have a background context, so we bring it as part of this thread.

Possible Timeline to Release TVM Unity

As of now tvm unity is being developed as a branch in the tvm project. Because of the level of new features being brought in, as well as the ML/AI landscape shift we are at. It is best to view unity as a major change in terms of the set of approaches to reflect the trends in AI/ML ecosystem.

As per our past approach, we would like to take a good amount of time here for us to have a sufficient understanding discussion about our overall technical strategy. In the meantime, we would also like to do it in a timely fashion without draining extra energy from the community, since foundational models, as per our recent survey, are a strong need by more than 90% of the community members and are something we would like timely support.

Likely, we will look at Nov and Dec timeframe, so we get at least one month plus the past year’s effort of bringing awareness of unity. Please bring up a discussion thread, ask questions, and let us continue to answer possible questions. We also welcome ideas about other concrete action items we can take to empower foundational models in a timely fashion.

Given the change here and different views in the community, we still view unity as a revolution with impact minimization in mind. There are several ways we can bring unity into a release while acknowledging the revolutionary nature of the change.

  • T0: Directly release from the unity branch, and effectively make the unity branch a product branch, this would require majority of community’s support.
  • T1: Change the current codebase main to the unity branch, ensuring the main modules are incorporated. This would requires the “codebase replacement” process, and would take 2/3 majority support.
  • T2: Bring tvm unity as a separate product, that is different from the tvm product.

If there is no majority community support, but a still a significant set of community members who would like to take the new philosophy, a common course of action in OSS ecosystem is to enable T2.

Considering the current level of support, likely T1 is the most practical approach that allows us to empower the needs of foundational models timely. Of course the final decision would fall into the community itself.

Minimizing impact of using unity branch to users The community has take great amount of effort to minimize disruption. Specifically, roughly weekly the changes in main are bought via git merge into the unity branch. If you are using the current features, directly use unity branch likely will continue to work as these modules are preserved.

Engage in unity and related discussions There has been a lot of amazing discussions over the past one year around what we can enable. We would love to welcome community members to continue discussions, bring up concrete ideas on how can we enable emerging needs together.

8 Likes

As a potential measure of community support, the majority of PRs are now targeting the unity branch. In the past 30 days, 57% of PRs (75 out of 132) have been on the unity branch. It looks like it’s also trending upward, as the past 90 days show 49% of PRs (213 out of 436) targeting the unity branch.

# Main branch
$ git log --after='2023-08-27' --oneline main | wc --lines
57
# Unity branch, excluding main
$ git log --after='2023-08-27' --oneline unity ^main | wc --lines
75

With that in mind, I’d view the unity branch as the de-facto main, and would support the T1 approach to make it be the actual main branch as well.

3 Likes

Hi @tqchen, can you please make this document easier to read by adding reminders of the definitions of letter/number combinations such as G0 when using them? I struggle to read these long documents where I have to keep referring back to what precisely a G0 is, especially when it gets interspersed with many other letter/number combinations.

Agreed with @Lunderberg. Also to note that most of major features submitted to main are eventually and exclusively used in unity, for example, NCCL/RCCL integration in CMake system. As of now, developers usually have to submit two parallel PRs to both branches, or wait for another merge from main to unity to happen to actually use those features, which is quite inconvenient.

Therefore, the T1 solution, i.e. switching unity to main, will ultimately benefit development productivity and help us evolve with velocity.

Thanks for the input. It is likely this boils down to personal preference. The main purpose of labeling is to reduce the overhead of expanding items and keep key points organized; indeed, the goal is to get everyone to focus on some of the key issues and dissect them if needed. This also aligns with most of the scientific writing, where citations are labeled.

One way that might help is to first go through these labeled key points, then do another read. Having these common key points in mind also helps us to build common conversations and common grounds.

In this particular case, G0 is not referred to in the text, so the reference is likely about something generic. But if you want to discuss any of the points, or any labeled points are less clear, feel free to bring it up so we can expand them further in the post.

This isn’t a personal preference. I have a neurodivergent condition, and I’m asking for a reasonable accommodation which may benefit others who are less capable of such advocacy.

Just did a pass to update the post so most reference to the label would resides in a very close location that would fit into the same screen while still being consistent with the scientific writing styles being used, which would be helpful for others. In the meantime, please feel free to bring up particular parts that would benefit from further clarification and happy to discuss particular technical points

1 Like

Thanks @tqchen for being always super sympathetic to personal condition of every community member, and we are always striving to be the most inclusive community :heart:

I agree Unity branch to be the default main branch as LLM becomes so popular, it is confused other users who should change it into Unity branch when they git clone tvm locally.

In my personal feeling, TVM’s momentum was at a peak around 2019-2020, but went down afterwards. I can’t speak for others, but for myself, I can’t keep contributing to TVM since 2021 because my team switched the gear to work on distributed training, which was popular by that time, but we felt it would be time consuming to propose and upstream training support (it turns out to be true if you look at Relax). Finally we worked on our own codebase and gradually got away from TVM upstream.

On the other hand, I recently found that TVM’s momentum goes up again along with the brand “MLC-LLM”, just because a few community members make some efforts to release a high-performance chat bot app of Llama-2 powered by TVM unity. This to me is a role model that shows the importance of catching up the workloads/applications most people care in order to stay on SOTA.

Consequently, it would be great and happy to see TVM unity becomes the official main branch, so that it’s likely to further accelerate the developments of LLM related features. For example, serving a LLM with tensor parallelism and quantization would be extremely important in the upcoming 6 months. NVIDIA is going to release TensorRT-LLM next week, it would be a big bump if MLC-LLM is able to be competitive.

My two cents.

6 Likes

I support TVM unity to be the official main branch.

I would like to share some of my experiences with tvm. In the past, there are several pain points when we tried to adopt TVM. For instance, quickly experimenting with an optimization was not easy and often required hacking the entire compilation stack. Moreover, the lack of support for training, dynamic shape, etc. limited its applicatiom. From my perspective, the current unity approach has addressed these concerns, and is crucial for increasing adoption.

3 Likes

I think we all agree that bringing the features development in unity is desirable for the many reason listed and stressed over and over in the posts above.

The strategy to bring the changes and incorporate unity changes, seems less optimal and less inclusive, given nobody in the community is expected to keep in sync with changes introduced in unity, as development branch.

Just taking the path that is most convenient for a subset of the community (like moving unity to be the new main, or nominating unity as the main branch) would force unknown and potentially breaking changes into the codebase, while leaving developers and end-users to their own devices, to figure out what works and what doesn’t. I don’t think it is an onus of the development community to keep track of all development branches.

I would like to suggest that together with T0, T1 and T2, that we consider a T3 and take “unity” as a regular contribution and merge it into main in organised chunks of features that would make more sense in the git history.

T3: Extract major features from unity and raise feature-level PRs against main. Fast-track them onto main once documentation and testing are in place, and current main CI passes.

This is also an opportunity for us as a community to understand the changes coming in, organise commit messages, describe features in an inclusive way, making sure documentation exists, etc.

A positive side seems that many of these changes are local to specific namespaces, and having them integrated in patches shouldn’t be a lot of work. From an engineering point of view, it looks more in line with the common practice in upstream project and much more inclusive as well, when compared to taking just a blob of ~200k+ lines into the project.

More importantly, I don’t think it is right to impose such a bulk change in the project this way. At the same time, I understand the time pressures some might feel, but those should be put in perspective of keeping the project coherent and visible to the wider community (not only unity branch maintainers).

The main reasons for T3 (a.k.a. orderly moving unity features to main) are:

1.Diff delta / opportunity for code review

Before becoming unity branch in TVM repo, there were hundreds of changes in which the broad community had no chance to review (the main cause being it was host somewhere else for a long time).

According to a rough comparison done on GitHub, the diff between unity and main today (including submodules) is something along the lines of “888 changed files with 211,567 additions and 8,684 deletions”.

2.Bundled features

In tandem with “unity” as a branch, there are quite a few features being brought in such as ”MSC Graph”, “Disco”, etc. which are largely unknown to the wider community, so in moving them orderly into main, is an opportunity to advertise those better to the community.

Remember all these features/tests become responsibility of all to maintain once they are integrated with the code.

3.Test coverage of the branch

Many tests and CI validation parts are currently disabled in the unity branch, which will require current contributors to rework features and also for them to figure out if their support works at all in the new landscape, which seems hostile to current contributors that have visibility of ”main”.

In moving changes orderly from unity to main, it has the advantage of at least guaranteeing minimum compatibility with current test base, so that we can keep the “revolution with impact minimization in mind” commitment stated above.

The cost of the change will always be somewhere, I’m just advocating that we don’t simply pull the plug on features people are currently relying on in main for lack of testing, and replace it with a less tested/reviewed version of it.

1 Like

Thank you for your input. Different members of the community have different interests, and there are common shared goals. One thing to be mindful of is that there are different ways, and some may work for certain modules, circumstances, and the current community demands.

In the end, it is also about the real practical executions of the community. Let me first expand on some of the points here.

Responsibility of maintenance

The reality is that the responsibility of mostly maintenance falls onto the members who contributed and developed the modules and related features. Such merit is developed through contributions. So naturally, most communities lean towards empowering the decisions of these members as long as the contribution is scoped in modules.

As a community with diverse needs, nobody is required to be aware of all modules to contribute. In practice we also pay attention to the modules we normally engage with. Based on our past year’s contribution activities, members who designed and maintained the core modules, such as arithmetic, relay IR design, TensorIR, AutoTVM, and some of the graph IR nodes, are responsible for and still actively maintain these modules in main.

Maintainers of core modules in the main branch are also now maintaining unity. But they all took the extra mile to bring things of existing modules to main to fulfill their responsibility, even if that incurs a good amount of overhead in the past year.

It is the same group of people who would continue to ensure that the features in the main continue to work as no changes go through them, through the module isolation and changes. We would lean more towards them for thinking about ways of incorporation.

Preserving main modules As stated above, the development practices as of now bring all the changes in the existing modules to main. Changes to these modules are covered by tests in the main. Other tests are turned off in unity due to cost, as many modules are not structured through a UT approach, which brought undesirably long time, and changes to unity are not related to these modules. We will of course, work together to ensure that the relevant modules don’t break and work to fix them when that happens. These are concrete conversations that we can have to enable the community.

Bring community awareness

Some of the suggestions boils down to the common goal of community awareness. There are many ways to achieve the goal, and the community is doing that in the past year, and continue to do so in the incoming months:

Sending modules in chunks to the main would have been a great evolution approach one year ago. The community worked to do some of the approaches when the unity branch was established. At the current state, however, the approach likely will be much less practical or desirable due to the energy and complexity involved to also maintain the goal of bring timely foundational model support.

In practice, we need pragmatic ways to enable the goal of bringing foundational model support timely. Ideally such support should have landed months ago, and timing matters, especially given the current ML/AI space. There is a great risk of lost momentum and relevance. We would already run into that risk in practice if we didn’t empower unity development.

This being said, we would love to see the community spend energy on more awareness. Please ask questions about the modules and bring suggestions and discussions, as we repeatedly mentioned in the past one year. We would love to work together to solve concrete questions like “here is how we bring BYOC” and these energies are worth spending. There are likely months we can continue to do so, and likely more to come.

Action matters and speaks louder

Many discussions we have are about how to approach our goals. Where naturally the community can have different opinions. Regardless of the conversation of how. We need concrete action and executions to land – which in term needs support from community members doing these ground works. Having a seemingly perfect approach won’t necessarily get us LLM support, or even get the community to collectively work toward that. We need real groundwork to make things happen, and empower the community to make these ground works.

Actions and results speak louder, and that is what users turn to the community for. We would need to avoid running the risk of being over-bureaucratic, as we do not exactly command that the community should take exactly one approach.

The reality is simple: insisting one seemingly perfect engineering approach won’t give us solutions for LLM. Nor will they resolve real technical challenges at the core, while the right architecture(and empowering members that does ground work to bring them) is the real key.

Following one specific approach is not a necessary nor sufficient condition to get a “good or right project”. Approaches can only facilitate and empower the community towards the goal.

In the end, it is the community who comes and builds code, does the groundwork, maintains the existing modules, and brings real LLM support. These are real works that every member sees taken into consideration when we make the collective decision on ways to move forward. We all have to do real groundwork to earn merits and, as a natural consequence, alignment from community members, which in turn is reflected in community strategy and collective choices that naturally empower these ground work.

It is good that we brought up these possible alternatives for the community to consider. Let us also work on ground works to help show the viability of the path. e.g. here is one approach, and here are some examples on enabling LLM or other needs through this approach.

These extra information can help community decide the path forward taking them into consideration.

The last thread was a bit long (mostly addressing how). Admittedly there are many possible approaches, some of us would prefer one approach. There is also a good amount of clarity about what the approaches are and we have spent quite a bit of conversations about how in the past year at meta-level. In the end it is the community that collectively decides the approach to take based on the specific context.

I would encourage us instead spend more energy here talking about concrete groundworks, actions, design conversations so we can realistically bring the needs. This would include examples like solutions for LLM serving, and actions to maintain current modules (e.g. sending improvements to arith). Ideas on how to leverage core strategy, or better strategy to execute on goals.

Many in the community would welcome these on-the-ground conversations. We love to bring Llama2, diffusion models, and different codegens for our community. It would have been much more fun and productive, and we trust the community to collectively figure out the how in each situation based on concrete on the ground signals.

I am a contributor to both main and unity, as well as most of the active contributors. More specifically, I initiated/contributed/maintained to the following modules in main:

  • TVMScript
  • TensorIR
  • Relay
  • Runtime
  • MetaSchedule
  • TOPI
  • RPC
  • TE

so far, I love the development in unity that people are focusing on the right thing, concrete ground work that is pushing the boundary of open source world, empowering individual contributors to democratize the underlying techniques, making sure they are not monopolized by only one powerful company or two, enabling critical usecases like LLM, StableDiffusion, multi-GPU inference, KVCache, etc.

I also continued to maintain and bring contributions to main modules as shared above. I believe T1 is the best approach to work with community to continue supporting modules both already in main and unity - I’d love to continue contributing this way under the T1 solution.

Anyways, to conclude, we should always focus on concrete ground work enabling usecases like LLM inference, diffusion models, etc., and I’d love to work together with all the community members to make this process easier to all our contributors!

2 Likes

LLM is one of the central topics in AI today. As a result, it’s great to see TVM unity becomes the main branch, which means TVM has the ability to accelerate and deploy the most popular AI workloads.

2 Likes

T1 is the favored option for me. Transitioning to unity as the main branch will offer numerous advantages to our community. Users can readily explore distinct TVM unity features such as distributed inference, LLM, and stable diffusion support. While current main branch users and contributors can continue on that branch. In fact, the modules from the main branch are incorporated in unity. This shift to unity as the main branch would allow them to access more features, and the existing use cases should still work. We can incrementally enable the skipped test cases in unity. However, time-intensive tests, like those for Mxnet/TensorFlow1/caffe2 frontends, might be best left disabled in unity to expedite our CI, which currently averages over 4 hours per PR in main.

I have concerns about T3. The unity branch currently has 600+ additional commits (210k+ lines of code) compared to the current main. Landing these commits through PRs from unity to main would be a lengthy process. Given the CI duration, potential change requests, review times, and other factors, it wouldn’t be surprising if this took much more than six months.

1 Like

As an active participant in the TVM community, I began contributing in 2019, initially to TensorCores and subsequently to the TensorIR project.

I firmly believe that concrete groundworks are crucial for the community’s collective growth. Consider TensorIR, for instance. The intricacies of its operation, including the details of ScheduleState and Primitive implementations, remain largely unnoticed. Yet, its popularity stems from its ease of use and its ability to meet emerging needs, such as Tensorizations.

A similar phenomenon should occur with unity. It provides foundational support for traditional models, much as Relay does, and the MLC-LLM has demonstrated unity's applicability to novel requirements. From a user perspective, unity is significantly more user-friendly than existing alternatives, which I anticipate will garner community approval.

In summary, our focus should be on making tangible contributions wherever possible.