[RFC] Meta Schedule (AutoTensorIR)

junrushao · May 28, 2021, 9:14pm

[RFC] Meta Schedule (AutoTensorIR)

Feature Name: Meta Schedule (AutoTensorIR)
Start Date: 2021-05-28
RFC PR: TBD (apache/tvm-rfcs#0000)
GitHub Issue: TBD (apache/tvm-rfcs#0000)

1. Summary

This proposal introduces Meta Schedule: a probabilistic scheduling DSL on TIR that unifies the approaches of AutoTVM and Auto Scheduler (Ansor). Meta schedule provides a pragmatic way to define the space of automatic tuning, extensibility in terms of all possible TIR schedule primitives like tensorization and loop partitioning, and customizability on every layer of the automation system.

Meta Schedule is our 3rd generation automatic scheduling system.

2. Motivation

Scheduling and Design Space

In TVM TensorIR, optimization of a TensorIR program is done via a sequence of transformations. For example, we reorder loops for better locality and we tensorize for specific hardware intrinsics. The process of invoking such a set of pre-defined transformations is called “scheduling”, and each transformation is called a “schedule primitive”. These primitives form a domain-specific language (DSL) describing the transformation of TensorIR programs. Design space is the set of all possible schedulings with respect to a TensorIR program.

Problems with the Current Scheduling System

Manual schedule: Developers optimize their programs by manually invoking schedule primitives, i.e. explore points in the design space with humans in the loop. This can be a tedious and error-prone approach, hence the creation of AutoTVM and AutoScheduler (Ansor).
AutoTVM: The automation system requires users to define “schedule templates” as the design space for each operator. Therefore, it is inextensible to hundreds of operators.
AutoScheduler (Ansor): It automatically generates schedule templates as the design space, according to a set of predefined “search rules”. However, it is non-trivial to extend AutoScheduler to new schedule primitives (tensorize, loop partition, software pipelining).
The three systems above have isolated sets of APIs with several layers of their own abstraction, which are not only hard to learn, but also engineering-intensive to customize.

Benefit of Meta Schedule

Succinct syntax, consistent APIs to TensorIR schedule with no other layer of abstraction.
Provides unified APIs for implementing manual schedule, AutoTVM and AutoScheduler (Ansor).
Extensibility to all the schedule primitives, including tensorization and loop partitioning. Almost no extra effort is needed to use a new primitive in auto-tuning.
The automation infrastructure is customizable across every layer.

3. Guide-level explanation

In this section, we describe the syntax of meta schedule DSL, and how it could be used to describe and auto-generate the design space.

3.1. Manual Schedule

Meta schedule APIs are almost the same as TE or TensorIR scheduling. Here is an example of a manual schedule for matrix multiplication:

# Designate a set of tile sizes
i_tiles = [16, 8, 8, 8]
j_tiles = [16, 8, 8, 8]
k_tiles = [256, 8]

# Tile the loops according to the tile sizes
i_0, i_1, i_2, i_3 = sch.split(loop=i, factors=i_tiles)
j_0, j_1, j_2, j_3 = sch.split(loop=j, factors=j_tiles)
k_0, k_1           = sch.split(loop=k, factors=k_tiles)

# Organize the loops into “SSRSRS” 6-level tiles
sch.reorder(
    i_0, j_0, # S
    i_1, j_1, # S
    k_0,      # R
    i_2, j_2, # S
    k_1,      # R
    i_3, j_3, # S
)

In this example, the developers may tweak the tile sizes and measure the performance of the generated kernels to explore the opportunities of potential optimization.

Generally speaking, while writing a schedule, there are often some parameters that are hard to determine ahead of time, for example, tile sizes, unroll steps, or which tensor intrinsics to use. Developers may manually enumerate possible combinations of these unknown factors, and then pick the best schedule according to measurement results on their device.

3.2. AutoTVM-style Design Space Description

Meta schedule extends the schedule DSL with sampling instructions. When included in a schedule, these instructions parametrize the schedule from a single deterministic point to a space supported by random variables (tile size, etc.), making it possible for developers to describe the design space with meta schedule APIs.

We can extend the matmul example above to cover all possible tilings using these sampling instructions:

# Sample tile sizes
i_tiles = sch.sample_perfect_tile(i, n=4)
j_tiles = sch.sample_perfect_tile(j, n=4)
k_tiles = sch.sample_perfect_tile(k, n=2)
# Tile the loops according to the random variables
i_0, i_1, i_2, i_3 = sch.split(loop=i, factors=i_tiles)
j_0, j_1, j_2, j_3 = sch.split(loop=j, factors=j_tiles)
k_0, k_1           = sch.split(loop=k, factors=k_tiles)
# Organize the loops into “SSRSRS” 6-level tiles
sch.reorder(
    i_0, j_0, # S
    i_1, j_1, # S
    k_0,      # R
    i_2, j_2, # S
    k_1,      # R
    i_3, j_3, # S
)

3.3. Composite Schedule

Each schedule primitive handles only a very basic operation to transform the IR, for example, split only splits a loop into two. In the real world, the over-fine granularity of those primitives usually leads to repetitive and verbose scheduling code, as mentioned by developers in our community.

To counter this challenge, we allow users to register “composite schedules” that analyze the IR, and apply a set of schedule primitives correspondingly. For instance, a composite schedule may inspect a TensorIR block and decide whether we should call compute_inline on it. The composite schedule may use sampling instructions to fill in undecided choices.

Our system also ships with some built-in composite schedules, including:

Multi-level tiling
Inline pure spatial blocks
Parallelize & vectorize & unroll
Auto tensorize
…

3.4. AutoScheduler-style Design Space Generation

AutoScheduler (Ansor) generates schedule templates by applying their SearchRules to each stage. Meta schedule treats a search rule as a composite schedule, and applies each composite schedule to each block of TensorIR to generate the design space.

3.5. Unifying manual schedule / AutoTVM / Ansor

In this section, we show that the design space induced by TE manual schedule, AutoTVM and Ansor are all subsets of meta schedule, and meta schedule further allows mixing those three styles to search jointly.

Manual schedule. The TE schedule is a special case of a meta schedule program, where there is no randomness introduced by sampling instructions. It is a single point in terms of design space.

AutoTVM (Template-based tuning). Writing one or more schedule functions in meta schedule, potentially with sampling instructions, is a natural representation of AutoTVM’s schedule templates (knobs). The PPL generates one or more traces as the design space to explore.

AutoScheduler (Ansor, Template-free tuning). As mentioned in the previous section, application of composite schedule rules generates the design space, which is equivalent to Ansor’s sketch generation.

Mixing styles in design space definition. By taking union of the spaces induced by the three special cases, our system allows developers to combine generic rules that Ansor provides and operator-specific scheduling.

4. Reference-level explanation

In this section, we introduce the underlying techniques for the automation system to extract and explore the design space.

4.1. Execution trace as the design space

Trace. To represent the design space defined by the meta schedule DSL, the underlying system records all the instructions users applied to the schedule class, including sampling and schedule primitives. We call this list of instructions a trace.

Executing the example above results in the following trace:

Instruction 0. Sample-Perfect-Tile
Instruction 1. Sample-Perfect-Tile
Instruction 2. Sample-Perfect-Tile
Instruction 3. Split
Instruction 4. Split
Instruction 5. Split
Instruction 6. Reorder

The trace is not directly user-facing, but a data structure inside the user-facing Schedule class that records the execution. The automation system extracts the trace and finds out the design space according to the sampling instructions.

Union of traces. Often a single trace is unable to represent the entire space. Therefore, more precisely, our system works on a list of traces as the union of potential design space.

Fork a trace. When two different decisions in the scheduling process are equally important to generate high-performance schedules, we allow forking the trace into two, and the design space is the union of the forked traces.

4.2. Exploring the Search Space

Meta Schedule provides several built-in exploration strategies to exhaustively or efficiently search for efficient schedules.

Program replay. A simple strategy that replays the schedule program that generates the PPL, and doesn’t use any advantage provided by the PPL.

Random search. Extracts the PPL, and repetitively re-executes the PPL by flipping coins purely randomly.

Cost-model-guided evolutionary search. A more efficient exploration strategy. We define two sets of rules:

Mutator: defines how to jump to a point’s “neighbor” in the design space
Postprocessor: sometimes it is non-trivial to statically determine the PPL, for example:
- There is a hard requirement in CUDA that the maximum number of threads should not exceed 1024, but it is a random variable that cannot be determined before actually executing the PPL. In this case, we write a postprocessor that errors out when the condition is not satisfied.
- The number of outer loops to be fused together depends on their extents, which are random variables. In this case, we annotate the maximum extent allowed on the block, and do actual fusion in a postprocessor.

Our evolutionary search algorithm uses mutators to find possible schedules in the design space, then applies postprocessors, and asks the cost model to predict its performance. After several iterations, the new schedules with the highest scores are finally compiled and measured on device. Epsilon-greedy is used in this process to balance exploitation and exploration.

4.3. Python first for flexibility & customizability

We engineer the system in a way that all levels are decoupled and open to customization, aiming at providing a playground for developers to try out new ideas and potentially deliver performance quickly.

While all the important APIs are implemented in C++ for efficiency, every part of the system can be easily switched to customized python implementation. For example,

Customize design space in python. Can be a python function that does the schedule

def schedule_matmul(sch) -> sch:
    i, j, k = sch.get_loops(sch.get_block(“matmul”))
    i_tiles = sch.sample_perfect_tile(i, n=4)
    j_tiles = sch.sample_perfect_tile(j, n=4)
    k_tiles = sch.sample_perfect_tile(k, n=2)
    # Tile the loops according to the random variables
    i_0, i_1, i_2, i_3 = sch.split(loop=i, factors=i_tiles)
    j_0, j_1, j_2, j_3 = sch.split(loop=j, factors=j_tiles)
    k_0, k_1 = sch.split(loop=k, factors=k_tiles)
    # Organize the loops into “SSRSRS” 6-level tiles
    sch.reorder(
        i_0, j_0, # S
        i_1, j_1, # S
        k_0,      # R
        i_2, j_2, # S
        k_1,      # R
        i_3, j_3, # S
    )
    return sch

Customize composite schedule in python. We provide two ways to define a composite schedule in python:

Method 1. A simple decorator that converts a python function to a composite schedule

@tir.as_composite_schedule(name="multi-level-tiling")
def multi_level_tiling(sch: Schedule, block: BlockRV) -> Union[Schedule, List[Schedule]]:
    ...

Method 2. Derive from PyCompositeSchedule, providing extra functionalities like initialization

class MultiLevelTiling(PyCompositeSchedule):
    def initialize(...):
        ...
    def apply(...):
        ...

Customize exploration strategies in python. Developers can implement any search algorithm in python as well by deriving from PySearchPolicy.

Other customizable components. This list includes:

Cost model
Database
Measure callbacks
Feature extractor
Program builder & runner
Analysis methods
…

In a short summary, almost every component of the system is decoupled with each other and extensions could be easily plugged in.

4.4. Upstreaming Plan

[M3a] Core infrastructure of the PPL

Instruction
Trace
Composite schedule
Sampler
Search policy
Design space generator

[M3b] Host-side search infra

Database
Cost model
Measure callback

[M3c] RPC-related search infra

Measure input, build result, measure result
Builder
Runner

[M4a] Implementation of rules

Various built-in composite schedules
Various built-in mutators
Various built-in postprocessors
Automatic tensorization

[M4b] Relay integration

5. Drawbacks

We are not aware of any drawbacks of the proposed system.

6. Rationale and alternatives

The system is designed with the principle of minimalism: different from alternative solutions, we do not require any change in existing codebase, or extra APIs to learn. It could potentially lower the bar of using automation systems.

Unifying manual scheduling, AutoTVM’s semi automatic templates and AutoScheduler’s (Ansor’s) fully automatic sketch generation provides flexible way to balance injection new domain knowledge and automation.

Flexibility in customization allows quick try-out on new tasks, new strategies and new hardware targets without deep knowledge of the system.

7. Prior art

Tensor Expression (TE) in TVM is a DSL that decouples compute and schedule, which provides convenient ways to handcraft optimized kernels for different hardware targets.

TensorIR is the latest generation of TVM’s low-level IR. Its capability of eagerly applying schedule primitives opens the door for meta schedule, our proposed new-generation auto scheduling system.

AutoTVM is the 1st generation automation framework in TVM, which requires developers to implement per-operator scheduling templates, and the system could handle the tuning process.

AutoScheduler (Ansor) is the 2nd generation automation framework in TVM, whose built-in rules could automatically generate schedule templates for almost all the operators on CPU, GPU, etc.

8. Unresolved questions

Supporting Control Flow and Assertions

Right now the meta schedule DSL does not support control flow. Although we didn’t see any real-world use case right now, it is possible that it could appear in some future workloads.

A real-world issue we could see is that sampling may lead to wrong schedules on CUDA, e.g. the schedule results in a CUDA program that uses too much shared memory, too many threads, etc. In this case, we need to halt the program immediately. Therefore, introducing assertion may be helpful.

9. Future possibilities

Unifying Manual Scheduling, AutoTVM and Ansor in TOPI

Meta schedule provides an idiomatic approach to unify the three existing scheduling APIs in TVM:

Manual schedules are meta schedules without sampling instructions
AutoTVM templates are meta schedules where knobs are replaced by sampling instructions
Each of Ansor’s search rules generates a snippet of a meta schedule

We further allow mixing different styles of scheduling and exploring the union space, which could help dispatch to different implementations.

luchangli · May 29, 2021, 2:05pm

Looks perfect. I have used Ansor for some times, the Ansor is quite good for auto scheduling, but the tuning speed is very slow (more than 1 hour to tune a compute intensive op such as Conv2D or Conv2DBackpropInput). The Ansor does generate faster Conv2D or Conv2DBackpropInput than the op compared with TensorFlow, but the percentage is less than 50% for most model I have tuned, and Ansor rarely generates Conv2D faster than TensorRT. So, for this new schedule work, will you enhance the tuning speed, just like the work published by facebook (https://proceedings.mlsys.org/paper/2021/file/73278a4a86960eeb576a8fd4c9ec6997-Paper.pdf)? And will this schedule work generate faster program than Ansor, and faster than op in DL framework such as TensorFlow with higher probability?

junrushao · May 29, 2021, 6:29pm

Thanks @luchangli for asking!

Tuning speed and kernel performance could be improved in several directions, and I believe our system paves the path for them:

improve the system to allow faster search
a better cost model or search algorithm
more schedule primitives, like software pipelining and tensorization

jcf94 · May 30, 2021, 1:53am

Looking forward to have Meta Schedule in main stream soon!

tkonolige · June 7, 2021, 7:08pm

Should we be using the formal RFC process for this? (Submitting this RFC as a PR to the tvm-rfcs repo).

tqchen · June 7, 2021, 7:54pm

Yes, we should. I believe this thread is mainly for pre-RFC discussions

hogepodge · June 7, 2021, 7:58pm

Hi @junrushao, can you write this up as a pull request for the TVM RFCs repository?

junrushao · June 8, 2021, 8:26pm

@hogepodge Hey I created an RFC in the official TVM-RFC repo: [RFC] Meta Schedule (AutoTensorIR) by junrushao1994 · Pull Request #5 · apache/tvm-rfcs · GitHub, but not super sure if it is in the desirable format. Please feel free to step in, edit directly or request changes

jkosaian · January 8, 2022, 12:10pm

My apologies if I’m missing something simple, or if this is the incorrect place to ask this question.

I am trying to understand the current status of the integration of meta schedule in the main TVM branch. Based on my understanding of the RFC tracking issue, is it correct to say that the current implementation in the main branch cannot yet support the end-to-end tuning of a network via meta schedule (e.g., an example similar to this, but using meta schedule)?

Thanks!

junrushao · January 8, 2022, 10:40pm

@jkosaian We do have end-to-end tuning working on our local branches, but will need a couple of weeks to upstream them. Please track closely with @zxybazh’s work

zxybazh · January 8, 2022, 11:26pm

Yeah thanks for paying close attention. The main framework of meta-schedule has been upstreamed and we are working on more implementations on concrete classes to make it complete soon.

MJKlaiber · March 3, 2022, 4:11pm

The RFC states the following as “Unresolved question”:

Control Flow

The meta schedule DSL does not support control flow yet. Although there is no report of real-world use case at the time of writing, it is possible that it could appear in some future workloads. The best syntax of the control flow is not determined yet, but a working example could be TensorFlow’s tf.cond.

Do you mean something like this:

    # Designate a set of tile sizes
    i_tiles = [16, 8, 8, 8]
    j_tiles = [16, 8, 8, 8]
    k_tiles = [256, 8]

    # Tile the loops according to the tile sizes
    i_0, i_1, i_2, i_3 = sch.split(loop=i, factors=i_tiles)
    j_0, j_1, j_2, j_3 = sch.split(loop=j, factors=j_tiles)
    k_0, k_1           = sch.split(loop=k, factors=k_tiles)

    # Is this what you mean by control flow not yet supported?
    sch.cond(i_0 > 5, 
        sch.reorder(i_0, j_0,i_1, j_1,k_0,i_2, j_2,k_1,i_3, j_3), # true_fn
        sch.reorder(i_3, j_0,i_2, j_3,k_0,i_1, j_2,k_1,i_0, j_1), # false_fn
   )

CC: @junrushao, @aca88 @paulpb, @cgerum, @SebastianBoblestETAS

junrushao · March 3, 2022, 8:59pm

@MJKlaiber Thanks for the discussion! Control flow is definitely an interesting topic even if there isn’t real-world report yet

Yes, that’s what I mean for “no control flow support”. However, note that control flow brought up new challenges on expressing the branching logic, e.g. do we express it via CPS or sub-traces or CFG, etc, which left some room for discussion

masahi · March 4, 2022, 1:00am

Dont we already have tir If node?

aca88 · March 7, 2022, 7:44am

Isnt @MJKlaiber’s example of conditional application of a scheduling primitive and not of actual computation?

i.e. I would expect his code to generate the TensorIR of either of the reordered loops and not of both.

junrushao · March 7, 2022, 8:17am

@masahi Note that MetaSchedule isn’t part of TIR - it’s a scheduling DSL that manipulates TIR. Therefore, the If-node that exists in TIR couldn’t be reused for MetaSchedule DSL because they are two different things (i.e. IR vs scheduling DSL).

@aca88 This would be interesting to think about generating two chunks of TIR that are dispatched in runtime using a schedule primitive, despite my original intent use “control flow” to refer to Schedule DSL-level control flow rather than TIR-level.

cgerum · March 8, 2022, 11:06am

I think @aca88 and @MJKlaiber are mainly interested in Schedule DSL-level control flow.

As @MJKlaibers example does not contain any sampling, I assume it should work fine as is, but the problem arises from the following example:

class MyScheduleRule(PyScheduleRule):
    def initialize_with_tune_context(self, context: "TuneContext") -> None:
        pass

    def apply(self, sch: tir.Schedule, block: tir.schedule.BlockRV) -> List[tir.Schedule]:
        if len(sch.get_loops(block)) != 3:
            return [sch]
        i, j, k = sch.get_loops(block)
        factors = sch.sample_perfect_tile(i, n=2)
        i_0, i_1 = sch.split(i, factors=factors)

        sch1 = sch.copy()
        i_0, i_1, j, k = sch1.get_loops(block)
        sch1.reorder(i_0, j, k, i_1)

        sch2 = sch.copy()
        i_0, i_1, j, k = sch2.get_loops(block)
        sch2.reorder(i_0, j, i_1, k)

        return [sch1, sch2]

Without support for conditional control flow generating the two schedule templates seems not avoidable. A possible Workaround might be adding the constraints as block annotations:

class MyScheduleRule(PyScheduleRule):
    def initialize_with_tune_context(self, context: "TuneContext") -> None:
        pass

    def apply(self, sch: tir.Schedule, block: tir.schedule.BlockRV) -> List[tir.Schedule]:
        if len(sch.get_loops(block)) != 3:
            return [sch]
        i, j, k = sch.get_loops(block)
        factors = sch.sample_perfect_tile(i, n=2)
        i_0, i_1 = sch.split(i, factors=factors)

        sch1 = sch.copy()
        i_0, i_1, j, k = sch1.get_loops(block)
        sch1.reorder(i_0, j, k, i_1)
        cond_block = sch1.blockize(i_1)
        sch1.annotate(cond_block, "extent_gt", 8)

        sch2 = sch.copy()
        i_0, i_1, j, k = sch2.get_loops(block)
        sch2.reorder(i_0, j, i_1, k)
        cond_block = sch2.blockize(i_1)
        sch2.annotate(cond_block, "extent_le", 8)

        return [sch1, sch2]

And filtering constraints in postprocessor:

class MyPostProc(PyPostproc):
    def initialize_with_tune_context(self, context: "TuneContext") -> None:
        pass

    def apply(self, sch: tir.Schedule) -> bool:
        mod = sch.mod
        print(mod)

        for inst in sch.trace.insts:
            if inst.kind.name == "Annotate":
                if "extent_le" in inst.attrs or "extent_gt" in inst.attrs:
                    block = sch.get_sref(inst.inputs[0]).stmt
                    cond = inst.attrs[0]
                    bound = inst.inputs[1]
                    extent = block.body.extent
                    print(cond, bound)
                    if "extent_le" in inst.attrs:
                        if extent > bound:
                            print("Rejected!")
                            return False
                    elif "extent_gt" in inst.attrs:
                        if extent <= bound:
                            print("Rejected!")
                            return False

        return True

If the constraints are mutually exclusive it is probably better to not do any loop ordering in the schedule rule and directly do the loop ordering in the postprocessor, as this does not explode the design space.

junrushao · March 12, 2022, 7:58am

@cgerum Thanks for asking! This is really valuable question and I’m happy to provide more information

As you suggested, the lack of conditionals lead to a hacky workaround to introduce two different design space, which is less ideal. Alternatively, one trick that we used widely to align with AutoScheduler performance is to move decision-making into all postprocessors as long as there isn’t randomness.

In terms of loop extent filtering only, we might want to introduce sch.reject_if so that the interpreter could be able to reject bad samples early.

However, in the most generic case, it’s not avoidable to introduce control flow, either structured and unstructured control flow - either way could lead to a proper design. On the other hand, we will need to think twice about potential ways to serialize the DSL with minimal change in terms of backward compatibility

masahi · March 13, 2022, 11:01pm

In autotvm, we can tune over binary decision like this:

github.com/apache/tvm

python/tvm/topi/x86/batch_matmul.py

894772975


      
                      * packed_y[b, tvm.tir.indexdiv(j, 16), tvm.tir.indexdiv(ak, 4), j % 16, ak % 4].astype(
                          "int32"
                      ),
                      axis=ak,
                  ),
                  tag="batch_matmul_vnni",
              )
          
              _, a_y, _ = z.op.axis
              cfg.define_split("tile_y", a_y, num_outputs=2)
              cfg.define_knob("layout_trans_compute_root", [0, 1])
          
              return z
          
          
          def batch_matmul_vnni_schedule(cfg, s, C, O, layout_trans):
              """Schedule batch_matmul compute using VNNI vpdpbusd instruction"""
              # C: The output of batched GEMM
              # O: The output of the fused op
          
              # Schedule the GEMM part

github.com/apache/tvm

python/tvm/topi/x86/batch_matmul.py

894772975


      
          """Schedule batch_matmul compute using VNNI vpdpbusd instruction"""
          # C: The output of batched GEMM
          # O: The output of the fused op
          
          # Schedule the GEMM part
          s, fused_inner = dense_vnni_schedule(cfg, s, C, O, do_parallel=False)
          # Parallelize over batch
          fused = s[O].fuse(O.op.axis[0], fused_inner)
          s[O].parallel(fused)
          
          if cfg["layout_trans_compute_root"].val:
              s[layout_trans].compute_root()
              schedule_injective_from_existing(s, layout_trans)
          else:
              s[layout_trans].compute_at(s[O], fused)
              _, _, _, ni, ki = s[layout_trans].op.axis
              s[layout_trans].vectorize(ki)
              s[layout_trans].unroll(ni)
          
          return s

Is this the same “control flow” problem we are talking about? I thought sample_categorical might do the job, but I cannot make a binary decision based on the value of this variable in my python script.

If the example I gave above is the same as “control flow problem” you brought up, there are many use cases. I’m porting my TE VNNI batch_matmul implementation to TIR and hit this issue. cc @junrushao

cgerum · March 14, 2022, 8:48am

Yes, I agree there should be proper conditional support in the Scheduling-DSL. Would it make sense to open a Ticket (TVM or TVM-RFC) to better track the conditional support in Scheduling-DSL?