[DISCUSS] TVM Community Strategy for Foundational Models

Dear Community:

As we all have witnessed in the past half year, the world of AI/ML is evolving rapidly, with the arrival of foundational models: stable diffusion for image generation, whisper for voice recognition, GPT, and open LLMs(llama, vicuna, MPT, RedPajama, Falcon).

These pipelines encapsulate unique characteristics in aspects like: memory cost, demand for extreme quantization, and dynamic shape becoming a mandatary. They also involve more complex pipelines. Things are no longer a single tensor in-out situation, and there is a great demand for more customization on how optimization, quantization mechanisms, and deployments are done.

These factors would bring the need for innovations and solutions that solves these problems, sometimes in different ways from the existing workloads we tackled before. Additionally, the rate of innovations in the foundational models is high, bringing the need for rapid evolution of solutions to meet the modeling optimization needs for the modules related to foundational models.

All of these factors bring a need for open-source ML framework communities to think about the strategies. While it is not necessary to re-orient everything that we are doing towards foundational models, certainly, there is a great need to have a strategy to enable and amplify thrusts in the TVM community for foundational models – a lot of that also requires us to bring in innovative solutions to the TVM umbrella and support these emerging needs.

Of course, as an open community, this is also a choice up to the community. I am opening up this thread to see your thoughts and also think it is quite important to our relevance in the ML/AI ecosystem in general.

This is a discussion thread to see community member’s intent. Specifically, would you like the tvm community to support and focus thrusts to innovate foundational models(that of course, is not exclusive to other areas that we already support).

  • Yes, love to support community to enable and amplify thrusts for foundational models
  • I don’t see a strong need

0 voters

Please share your thoughts and suggestions in reply :slight_smile: I also open a short poll to see our overall interests


There are several key features I believe is in urgent need to properly enable those workloads:

  • KVCache management, including appending to KVCache, switching in/out different KVCache (chat history), heterogenous offloading of KVCache in-between devices, etc
  • Flexible integration of quantization algorithms, where each algorithm may have its own simulated data type (e.g. int4, NF4), where adding quantized operators in a normal way across TVM stack (needing to update TE/TOPI/TIR/Relay/AutoTVM) becomes immediately impossible if aim at scaling up.
  • Dynamic shape representation, management, and code generation across the stack, particularly, tracking the input/output relationship of each dynamic shape subgraph, for example, in LLM, there are two independent dynamic shape vars n and m which we should distinguish.

Targeting foundational models will provide motivation to improve our compiler infrastructure to handle the features mentioned in @junrushao’s reply, so I think it would be a good way to set priorities for other aspects of the project as well.


Foundation models are important workloads, and by pushing the local/server inference of LLMs to the extreme in the TVM stack, I believe we can push the resolution of pain points to a new stage for people to use TVM as THE deep learning compiler in general scenarios, which is necessary for us to keep competitive (and alive) and be able to provide productivity continuously.

To name some

1. Out-of-box kernel performance

Triton has gradually become the solution for people writing custom ops on CUDA. TorchInductor uses Triton to generate codes for long-tail operators (reduction/elementwise) by default and will try to take Triton GEMM into search space if more tuning is turned on.

LLM typically has 2 phases, prefill and decoding, where prefill is bound by GEMMs and decoding is bound by (quantized) GEMVs, which are representative workloads for GEMM ops and long-tail ops respectively. PT team tried to use TVM as its backend, but MetaSchedule takes too long to tune and generates sub-optimal programs. The resulting performance is poor given a certain amount of compilation time.

We see an interesting future for other less popular backends, especially when it’s equipped with unified memory which allows larger models to run. It’s necessary to have knowledge of the kernel scheduling on these backends and to fully leverage the advantage of TVM to transfer schedules between different backends.

2. More advanced operator fusion

TVM has generally only used vertical fusion at the graph level. We do see horizontal fusion (3GEMM → larger GEMM) to play an effect in LLM models.

Meanwhile, to be able to fuse GEMM (Conv) → (norm) → GEMM (Conv) operators and generate efficient code can be important for attention ops. Instead of relying on CUTLASS/FT on CUDA, we can transfer such patterns to other backends.

There have been lines of research around the fusion algorithms in DL compilers, most of which are search based. Here we focus only on default patterns that generally work across workloads, but we have already known there are missing pieces for us to do.

3. Distributed inference

No matter how good the quantization scheme and memory planning algorithms are, people are always greedy for the size of the model they can run. Even with 4bit quantization, 70B Llama2 still enquires 2x 40GB A100 to simply hold the model.

Instead of swapping the model between much slower storage and GPU memory, unified memory provides an interesting solution here (64GB mbp can serve 70B Llama2 alone).

Another orthogonal solution is distributed inference, and they can be combined together.

4. More hackable infrastructure and fewer new concepts in general

We have seen some recent DL compilers written in purely Python (TorchInductor, Hidet), which provides a much easier debug and hack experience for engineers.

Also, we have seen projects like llama.cpp and llama.c, which use purely cpp/c to implement the whole model and kernels. People are actively contributing to it and I believe one important reason is that it’s straightforward to understand by reading through its code, and people who have little knowledge of how DL compiler generally works can hack and debug into the infra.

To be able to import and modify the model, insert new layers, substitute generated kernels with different implementations in shader language, and change the operator fusion in a smooth manner as people expect in their mind, while at the same time providing reasonable debugging tools like setting breakpoints, inspecting intermediate outputs (in Python) as people has always been doing since their first day start to learn to program, can enable more people to come to contribute to our stack.


One thing I’ve thought about is asynchronous execution support in Relax. I don’t know if this is already planned as part of either Heterogenous execution or DistIR work, but just wanted to mention it in the discussion.

Even though we have async support in TIR, async support at the graph level could open up a lot of optimization opportunities, but it would also of course need to be planned out properly.


Thanks everyone for great thoughts :slight_smile: This is definitely something that relates to TVM’s relevance in the ML/AI ecosystem and it is great to see positive responses.


On a related topic, we think it is helpful to have process clarity for the community to make collective decisions to empowering directions like foundation models. Given the positive response, I created the following proposal [Process RFC] Clarify Community Strategy Decision Process


Linux foundation Edge is trying to start a new org, focusing on AI, with the theme of big model. We will contact you soon for discussion possible collaborations.


Thanks everyone for chime in, please check [DISCUSS] TVM Core Strategy for Emerging Needs - #3 by tqchen for a concrete proposal