[RFC] Meta Schedule (AutoTensorIR)

luchangli · May 29, 2021, 2:05pm

Looks perfect. I have used Ansor for some times, the Ansor is quite good for auto scheduling, but the tuning speed is very slow (more than 1 hour to tune a compute intensive op such as Conv2D or Conv2DBackpropInput). The Ansor does generate faster Conv2D or Conv2DBackpropInput than the op compared with TensorFlow, but the percentage is less than 50% for most model I have tuned, and Ansor rarely generates Conv2D faster than TensorRT. So, for this new schedule work, will you enhance the tuning speed, just like the work published by facebook (https://proceedings.mlsys.org/paper/2021/file/73278a4a86960eeb576a8fd4c9ec6997-Paper.pdf)? And will this schedule work generate faster program than Ansor, and faster than op in DL framework such as TensorFlow with higher probability?

junrushao · May 29, 2021, 6:29pm

Thanks @luchangli for asking!

Tuning speed and kernel performance could be improved in several directions, and I believe our system paves the path for them:

improve the system to allow faster search
a better cost model or search algorithm
more schedule primitives, like software pipelining and tensorization

jcf94 · May 30, 2021, 1:53am

Looking forward to have Meta Schedule in main stream soon!

tkonolige · June 7, 2021, 7:08pm

Should we be using the formal RFC process for this? (Submitting this RFC as a PR to the tvm-rfcs repo).

tqchen · June 7, 2021, 7:54pm

Yes, we should. I believe this thread is mainly for pre-RFC discussions

hogepodge · June 7, 2021, 8:00pm

Hi @junrushao, can you write this up as a pull request for the TVM RFCs repository?

junrushao · June 8, 2021, 8:26pm

@hogepodge Hey I created an RFC in the official TVM-RFC repo: [RFC] Meta Schedule (AutoTensorIR) by junrushao1994 · Pull Request #5 · apache/tvm-rfcs · GitHub, but not super sure if it is in the desirable format. Please feel free to step in, edit directly or request changes

jkosaian · January 8, 2022, 12:10pm

My apologies if I’m missing something simple, or if this is the incorrect place to ask this question.

I am trying to understand the current status of the integration of meta schedule in the main TVM branch. Based on my understanding of the RFC tracking issue, is it correct to say that the current implementation in the main branch cannot yet support the end-to-end tuning of a network via meta schedule (e.g., an example similar to this, but using meta schedule)?

Thanks!

junrushao · January 8, 2022, 10:40pm

@jkosaian We do have end-to-end tuning working on our local branches, but will need a couple of weeks to upstream them. Please track closely with @zxybazh’s work

zxybazh · January 8, 2022, 11:26pm

Yeah thanks for paying close attention. The main framework of meta-schedule has been upstreamed and we are working on more implementations on concrete classes to make it complete soon.

MJKlaiber · March 3, 2022, 4:11pm

The RFC states the following as “Unresolved question”:

Control Flow

The meta schedule DSL does not support control flow yet. Although there is no report of real-world use case at the time of writing, it is possible that it could appear in some future workloads. The best syntax of the control flow is not determined yet, but a working example could be TensorFlow’s tf.cond.

Do you mean something like this:

    # Designate a set of tile sizes
    i_tiles = [16, 8, 8, 8]
    j_tiles = [16, 8, 8, 8]
    k_tiles = [256, 8]

    # Tile the loops according to the tile sizes
    i_0, i_1, i_2, i_3 = sch.split(loop=i, factors=i_tiles)
    j_0, j_1, j_2, j_3 = sch.split(loop=j, factors=j_tiles)
    k_0, k_1           = sch.split(loop=k, factors=k_tiles)

    # Is this what you mean by control flow not yet supported?
    sch.cond(i_0 > 5, 
        sch.reorder(i_0, j_0,i_1, j_1,k_0,i_2, j_2,k_1,i_3, j_3), # true_fn
        sch.reorder(i_3, j_0,i_2, j_3,k_0,i_1, j_2,k_1,i_0, j_1), # false_fn
   )

CC: @junrushao, @aca88 @paulpb, @cgerum, @SebastianBoblestETAS

junrushao · March 3, 2022, 8:59pm

@MJKlaiber Thanks for the discussion! Control flow is definitely an interesting topic even if there isn’t real-world report yet

Yes, that’s what I mean for “no control flow support”. However, note that control flow brought up new challenges on expressing the branching logic, e.g. do we express it via CPS or sub-traces or CFG, etc, which left some room for discussion

masahi · March 4, 2022, 1:00am

Dont we already have tir If node?

aca88 · March 7, 2022, 7:44am

Isnt @MJKlaiber’s example of conditional application of a scheduling primitive and not of actual computation?

i.e. I would expect his code to generate the TensorIR of either of the reordered loops and not of both.

junrushao · March 7, 2022, 8:17am

@masahi Note that MetaSchedule isn’t part of TIR - it’s a scheduling DSL that manipulates TIR. Therefore, the If-node that exists in TIR couldn’t be reused for MetaSchedule DSL because they are two different things (i.e. IR vs scheduling DSL).

@aca88 This would be interesting to think about generating two chunks of TIR that are dispatched in runtime using a schedule primitive, despite my original intent use “control flow” to refer to Schedule DSL-level control flow rather than TIR-level.

cgerum · March 8, 2022, 11:06am

I think @aca88 and @MJKlaiber are mainly interested in Schedule DSL-level control flow.

As @MJKlaibers example does not contain any sampling, I assume it should work fine as is, but the problem arises from the following example:

class MyScheduleRule(PyScheduleRule):
    def initialize_with_tune_context(self, context: "TuneContext") -> None:
        pass

    def apply(self, sch: tir.Schedule, block: tir.schedule.BlockRV) -> List[tir.Schedule]:
        if len(sch.get_loops(block)) != 3:
            return [sch]
        i, j, k = sch.get_loops(block)
        factors = sch.sample_perfect_tile(i, n=2)
        i_0, i_1 = sch.split(i, factors=factors)

        sch1 = sch.copy()
        i_0, i_1, j, k = sch1.get_loops(block)
        sch1.reorder(i_0, j, k, i_1)

        sch2 = sch.copy()
        i_0, i_1, j, k = sch2.get_loops(block)
        sch2.reorder(i_0, j, i_1, k)

        return [sch1, sch2]

Without support for conditional control flow generating the two schedule templates seems not avoidable. A possible Workaround might be adding the constraints as block annotations:

class MyScheduleRule(PyScheduleRule):
    def initialize_with_tune_context(self, context: "TuneContext") -> None:
        pass

    def apply(self, sch: tir.Schedule, block: tir.schedule.BlockRV) -> List[tir.Schedule]:
        if len(sch.get_loops(block)) != 3:
            return [sch]
        i, j, k = sch.get_loops(block)
        factors = sch.sample_perfect_tile(i, n=2)
        i_0, i_1 = sch.split(i, factors=factors)

        sch1 = sch.copy()
        i_0, i_1, j, k = sch1.get_loops(block)
        sch1.reorder(i_0, j, k, i_1)
        cond_block = sch1.blockize(i_1)
        sch1.annotate(cond_block, "extent_gt", 8)

        sch2 = sch.copy()
        i_0, i_1, j, k = sch2.get_loops(block)
        sch2.reorder(i_0, j, i_1, k)
        cond_block = sch2.blockize(i_1)
        sch2.annotate(cond_block, "extent_le", 8)

        return [sch1, sch2]

And filtering constraints in postprocessor:

class MyPostProc(PyPostproc):
    def initialize_with_tune_context(self, context: "TuneContext") -> None:
        pass

    def apply(self, sch: tir.Schedule) -> bool:
        mod = sch.mod
        print(mod)

        for inst in sch.trace.insts:
            if inst.kind.name == "Annotate":
                if "extent_le" in inst.attrs or "extent_gt" in inst.attrs:
                    block = sch.get_sref(inst.inputs[0]).stmt
                    cond = inst.attrs[0]
                    bound = inst.inputs[1]
                    extent = block.body.extent
                    print(cond, bound)
                    if "extent_le" in inst.attrs:
                        if extent > bound:
                            print("Rejected!")
                            return False
                    elif "extent_gt" in inst.attrs:
                        if extent <= bound:
                            print("Rejected!")
                            return False

        return True

If the constraints are mutually exclusive it is probably better to not do any loop ordering in the schedule rule and directly do the loop ordering in the postprocessor, as this does not explode the design space.

junrushao · March 12, 2022, 7:58am

@cgerum Thanks for asking! This is really valuable question and I’m happy to provide more information

As you suggested, the lack of conditionals lead to a hacky workaround to introduce two different design space, which is less ideal. Alternatively, one trick that we used widely to align with AutoScheduler performance is to move decision-making into all postprocessors as long as there isn’t randomness.

In terms of loop extent filtering only, we might want to introduce sch.reject_if so that the interpreter could be able to reject bad samples early.

However, in the most generic case, it’s not avoidable to introduce control flow, either structured and unstructured control flow - either way could lead to a proper design. On the other hand, we will need to think twice about potential ways to serialize the DSL with minimal change in terms of backward compatibility

masahi · March 13, 2022, 11:06pm

In autotvm, we can tune over binary decision like this:

github.com

apache/tvm/blob/894772975ab33443cf25f40d9f1e2f7b96224978/python/tvm/topi/x86/batch_matmul.py#L54


            * packed_y[b, tvm.tir.indexdiv(j, 16), tvm.tir.indexdiv(ak, 4), j % 16, ak % 4].astype(
                "int32"
            ),
            axis=ak,
        ),
        tag="batch_matmul_vnni",
    )


    _, a_y, _ = z.op.axis
    cfg.define_split("tile_y", a_y, num_outputs=2)
    cfg.define_knob("layout_trans_compute_root", [0, 1])


    return z




def batch_matmul_vnni_schedule(cfg, s, C, O, layout_trans):
    """Schedule batch_matmul compute using VNNI vpdpbusd instruction"""
    # C: The output of batched GEMM
    # O: The output of the fused op


    # Schedule the GEMM part

github.com

apache/tvm/blob/894772975ab33443cf25f40d9f1e2f7b96224978/python/tvm/topi/x86/batch_matmul.py#L70


"""Schedule batch_matmul compute using VNNI vpdpbusd instruction"""
# C: The output of batched GEMM
# O: The output of the fused op


# Schedule the GEMM part
s, fused_inner = dense_vnni_schedule(cfg, s, C, O, do_parallel=False)
# Parallelize over batch
fused = s[O].fuse(O.op.axis[0], fused_inner)
s[O].parallel(fused)


if cfg["layout_trans_compute_root"].val:
    s[layout_trans].compute_root()
    schedule_injective_from_existing(s, layout_trans)
else:
    s[layout_trans].compute_at(s[O], fused)
    _, _, _, ni, ki = s[layout_trans].op.axis
    s[layout_trans].vectorize(ki)
    s[layout_trans].unroll(ni)


return s

Is this the same “control flow” problem we are talking about? I thought sample_categorical might do the job, but I cannot make a binary decision based on the value of this variable in my python script.

If the example I gave above is the same as “control flow problem” you brought up, there are many use cases. I’m porting my TE VNNI batch_matmul implementation to TIR and hit this issue. cc @junrushao

cgerum · March 14, 2022, 8:48am

Yes, I agree there should be proper conditional support in the Scheduling-DSL. Would it make sense to open a Ticket (TVM or TVM-RFC) to better track the conditional support in Scheduling-DSL?

LeiWang1999 · December 22, 2022, 5:31am

Hi there, how to apply the best schedule that meta schedule searched?

I’m now trying to apply like below:

    database = ms.database.JSONDatabase(f"{workdir}/database_workload.json",f"{workdir}/database_tuning_record.json")
    with ms.ApplyHistoryBest(database):
        with tvm.transform.PassContext(
            opt_level=3,
            config={"relay.backend.use_meta_schedule": True, "tir.predicate_opt": True},
        ):
            mod = tvm.build(mod, target="cuda")
            print(mod.imported_modules[0].get_source())

but it seems no schedule was applied to the irmodule.