[Unity] Schedule Needed while building relax model with GPU Target

anders · June 26, 2023, 8:21am

Directly build relax model may occur “Did you forget to bind" hints，as there is no default schedule in model。

MLC-LLM tend to use DefaultGPUSchedule()，such as:

mod_transform = tvm.tir.transform.DefaultGPUSchedule()(mod_transform)

but in tvm-unity test demos，and mlc.ai website, tend to encourage users to write tvmscript and bind variable to thread manually， such as

@tvm.script.ir_module class MyModuleVecAdd: @T.prim_func def main(A: T.Buffer((1024,), “float32”), B: T.Buffer((1024,), “float32”), C: T.Buffer((1024,), “float32”)) → None: T.func_attr({“global_symbol”: “main”, “tir.noalias”: True}) for i in T.grid(1024): with T.block(“C”): vi = T.axis.remap(“S”, [i]) C[vi] = A[vi] + B[vi]

I wonder which approach will be mainstream in the future?

MasterJH5574 · June 26, 2023, 6:21pm

Hi @anders, thanks for the great question. Happy to explain things clearly here.

We know that a TIR function (e.g., main in your example) without thread binding cannot be built. This means we need to bind threads in certain approach. Right now in TVM-Unity, we have the following commonly used approaches:

A1. manually binding threads by either writing manual TIR Schedule which transforms TIR functions, or writing the final TIR function directly,
A2. using the DefaultGPUSchedule pass as you mentioned,
A3. using MetaSchedule to tune the TIR function, getting the tuned TIR function which have thread bound.

Each approach has its pros and cons:

A1 provides us with much flexibility. With A1 we can bind thread in a way which may not lie in the default schedule space or the tuning space of A2 and A3. On the other hand, A1 may require people to have some knowledge on both TIR and GPU program optimization.
A2 gives a quick enough way to enable a TIR function runnable on GPU (which does not need any knowledge on TIR and GPU programming). A2 uses naive approach to do thread binding. The way it binds thread can achieve good performance for purely spatial TIR functions, but is not ideal for more complicated workloads (like GeMM).
A3 uses the auto-tuning approach which can lead to the best performant TIR function, while the tuning process takes the most time (more than A2 and A1).

So comprehensively, we need to choose the approach according to our needs. For instance, in your example, the vector-add function is a purely spatial one, so you can use the default schedule pass which is both quick and effective. Right now, all the approaches contribute in MLC-LLM for different scenarios - they are all powerful tools of TVM Unity.

Lastly, it worths mention that our community is now pushing a more powerful default schedule package “dlight” which can give us much better performance and can be used out-of-box. The scaffolding PR is already merged https://github.com/apache/tvm/pull/15141, and we will discuss it in the TVM Unity open development meeting tomorrow [Unity] Open Dev Meeting June 27th.

Civitasv · July 5, 2023, 7:33am

I wonder why DefaultGPUSchedule is not set as the default option. Wouldn’t it be more convenient if TVM automatically uses this default schedule when the user doesn’t specify thread and block bindings, and the target is CUDA?

yzh119 · July 5, 2023, 9:19am

Because previously the DefaultGPUSchedule has bad performance for compute-intenstive operators like GEMM, the encouraged option for these operators is still using auto-tuning-based approaches like MetaSchedule (at the cost of searching for best schedules, which takes a non-significant amount of time) as @MasterJH5574 mentioned.

Ideally the new default schedule “dlight” would replace DefaultGPUSchedule which offers great performance at no auto-tuning cost.

Civitasv · July 5, 2023, 9:57am

I see. Thank you for your reply. That clarified some of my questions.