Evolving and Modernize Tensor-level IR

Over the past year, we have been successfully modernizing the foundational ffi module that is now helpful to the broader community. This thread aims to begin a discussion on modernizing the TIR components. Up until now, we have been leveraging the user-defined schedule paradigm to transform the code. While this approach remains useful in many domains, we are starting to see limitations where the user-defined schedule may not cover all possible optimizations, especially in programming the latest GPUs.

In the meantime, we also recognize the strong value in the low-level TVMScript and TIR infrastructure. It serves as a foundational layer to enable programming code in Python, offer robust kernel code generation, and ship kernels with tvm-ffi. These values continue to grow today for both downstream frameworks and R&D purposes. Given the latest state, we believe it is a good time to rethink how the TIR is structured. Specifically, I think we are moving towards the following two layers:

  • s-tir (schedulable TIR): This layer will contain the user-defined schedule and meta-schedule components. It will get decoupled from the core tensor-level IR and lowers to it
  • tir(next): We will evolve a new core abstraction to no longer rely on the schedule

The high-level idea is that s-tir will continue to serve its current purpose and lower to the core layer. We will evolve the new low level abstraction to be independent from the schedule, so it can be focused and more lightweight. The focus of the new core abstraction will become a more lightweight structure focused on representing low-level programs:

  • G0: Enable all possible optimizations via low-level access
  • G1: Python-first scripting, with rich support for kernel programming needs (e.g. support general control-flows, first class gpu threads and scopes)
  • G2: Robust code generation and connection to broader ecosystem via tvm-ffi

This will serve the upcoming needs to support the latest GPUs. This post aims to bring the community to this direction. We can start to do some refactors in the new year to enable this modernization. Hopefully, we can continue to support the community while also making the codebase more useful to the broader ML systems community.

4 Likes

Thanks for the proposal. Having a concrete low-level abstraction is definitely helpful for making the compiler infrastructure robust and composable. I also support making Python first for enabling it to a broader community.

Do we have some preferred feature set or case studys which the word Modernize refer to?

Actually the initial thought was simple, make sure we lift out schedule related component. So the core can focus on good amount of coverage for direct scripting and codegen, specific feature sets that i think might be relevant include:

  • Support good amount of scripting needs, like our recent support for step in ForNode is a good example
  • Improve error message part in host codegen, so it fits more tvm-ffi needs.
  • Robust codegen backend and latest hardware support, e.g. support nvrtc, enable latest GPU primitives like blackwell tmem and tma support/handling

Roughly the goal is to make a clear module of “dumb representation and codegen” that guarantees parsing/printing/codegen work reliably. So smarter things(such as scheduling, tile-level programs or autotuning) can be layered on top. Also happy to hear from everyone what are possible needs.