[RFC] TensorIR: A schedulable IR for TVM

aca88 · September 11, 2020, 5:36am

Hi,

Even though I don’t think I understood everything, I like the idea of solving some of the limitations of te.compute. Since the te.compute is in a central part of the TVM stack changing it requires a lot of work and understanding. So thank you all for continuing such development.

Q1: I was wondering how this fits in the Relay ->Topi->TE->TIR flow. In a more specific case take FuseOps. AFAIK the FuseOps pass creates the multi-stage operator based on the te.computes. Since you mentioned that there would be no notion of a stage in the new TensorIR, how would FuseOps work? and more generally how TVM’s philosophy of “defining a compute rule and a separate schedule” be changed?

Q2: What exactly do you mean by “not all program can be scheduled”? maybe an example?

Q3: You mentioned “new scheduling primitives”, could you maybe give a list?

EDIT: Q4: Expected timeline of the steps?

ds1231h · September 11, 2020, 5:36am

Well-received with thanks!

Hzfengsy · September 11, 2020, 6:08am

Thank you for your interest.

A1: Current op fusing is based on stage but the critical point is fusing the injective computation. We can also inline injective computation by traverse_inline. So there is no doubt that FuseOps works. As for the philosophy, I think there are only few changes. TIR is not only an IR but only can be a computation declaration. We provide very user-friendly API (as easy as TE) to define compute rules.

A2: TIR is a general IR which can represent almost every program. It’s really hard to schedule a general program, but we promise TIR can schedule all programs which TE can schedule.

A3: Most of the primitives are similar to the TE ones. For now, only two new primitives are decompose_reduction and merge_reduction

A4: Upstream on Oct and Nov. Ansor supporting is WIP

I hope I can answer your question.

kevinthesun · September 11, 2020, 7:04am

Thank you for this proposal! This work does make scheduling much easier. I have a concern about using this way to write a tensor expression. It looks like more complicated than tvm.compute when defining matmul. We need to define some buffers and creating block with corresponding shape dimension. It would be helpful if you can add a conv2d example which can replace existing topi.nn.conv2d definition to better understand what developer would need to write.

Another question is about representing generic programming style ops such as shape functions. Since these programs don’t fit into tvm scheduling, I assume it would still be more convenient to use existing te hybrid script to create these ops?

spectrometerHBH · September 11, 2020, 1:29pm

Thanks for your reply! @kevinthesun

In original te programming, we also have to declare buffers and create lambda expression with iter_vars having the correct shape dimension. If we take this additional info into account, TE programming is close to Hybrid Script’s programming in complexity.

Currently, we can not replace existing topi operators, since they are represented by Stage/Op and optimized by te schedule, while Hybrid Script will be parsed into TIR directly.

If we don’t have to schedule the PrimFunc, we don’t have to declare blocks in TIR. Actually, TE hybrid script is also a text representation of TIR to a large extent, with loop and condition statement directly representing the IR structure, using sugar like variable being translated to Array of size 1 to ease the format of Store&Load. At the moment, we haven’t introduced sugars to simplify such Load&Store, but the rest writing are largely simlilar.

tqchen · September 11, 2020, 3:36pm

TIR and TE do not conflict with each other. TE is still a useful DSL to stitch fragments of TIR together to form a PrimFunc.

We could still define TE based DSL(backed by TIR) that enables primitives like compute and hybrid calls to stitch together a dataflow graph to form a PrimFunc And then use the TIR for scheduling.

comaniac · September 11, 2020, 5:34pm

Thanks for the proposal! This definitely opens more opportunities for performance optimization. Two questions for clarification:

IIUC, based on the proposal and discussion, we will have both TE and TIR, but TE is more like a frontend wrapper of TIR to serve some users that prefer to write high-level DSL. Then, what will we do with the TE schedule primitives? Intuitively, we should still keep them; otherwise TE writers will have no way to schedule their computes, because they know nothing about TIR and blocks.

Does this proposal support dynamic shape (i.e., Any)? For example, can we have something like:

@tvm.hybrid.script
def matmul(a: ty.handle, b: ty.handle, c: ty.handle) -> None:
    C = tir.match_buffer(c, (1024, 1024), "float32")
    A = tir.match_buffer(a, (1024, Any), "float32")
    B = tir.match_buffer(b, (Any, 1024), "float32")
    reducer = tir.comm_reducer(lambda x, y: x + y, tir.float32(0))

    with tir.block([1024, 1024, tir.reduce_axis(0, 1024)], "C") as [vi, vj, vk]:
        reducer.step(C[vi, vj], A[vi, vk] * B[vk, vj])

s = tir.create_schedule(matmul)
update = s.get_block("C")
i, j, k = s.get_axes(update)
i_o, i_i = s.split(i, bn)
j_o, j_i = s.split(j, bn)
k_o, k_i = s.split(k, 4)

In this case, the length of vk (or k) is Any. Can we still apply split to it with a fixed factor

kevinthesun · September 11, 2020, 6:11pm

Thanks for explanation. The relation between te and new tir is now more clear to me.

kevinthesun · September 11, 2020, 6:33pm

Thanks for clarification. It would be nice if we can use various methods to create tensor programs and use new tir to schedule them.

Hzfengsy · September 12, 2020, 1:36am

Good questions!

As for as we know, we would like to let users use TensorIR schedule rather than TE schedule one we fully upstream the TensorIR. For three reasons:
1. Just as you have mentioned, TE is a fronted wrapper, and it directly generates TIR with blocks. Somehow, TE is more like a sugar to define TIR.
2. Most of the schedules and primitives in TensorIR are very similar to those in TE. The cost of learning TensorIR schedule is extremely low (maybe just one day).
3. All primitives are based on the block (no stage concept in TensorIR schedule). It’s hard to keep the TE schedule with block
Dynamic shapes are not supported now. However, thanks to our de-coupled primitives, it’s ok to support later.

kevinthesun · September 12, 2020, 1:42am

Would love to see dynamic shape supported otherwise a large set of models can’t be backed by new TensorIR.

comaniac · September 12, 2020, 1:55am

So the scenario is like you can choose to use TE or TIR to write a compute, but if you choose TE, you have to first lower it to TIR and then add schedule primitives?

IIUC, it seems to me that this is nontrivial, because TIR was not written by human and you may need to first print it out to figure out how to schedule it. It sounds more straightforward to keep TE schedule as syntactic sugar. At least you can get the sense about how the schedule looks like by tracing the Python code.

tqchen · September 12, 2020, 2:02am

Because there is a 1-1 mapping between te.Stage and Block. It should actually not be hard to use tir schedule to schedule a te compute generated PrimFunc (either by getting block via name, or pragmatically traverse the blocks like we do pragmatically on stages). But i agree that we can keep te.schedule for a bit.

comaniac · September 12, 2020, 3:21am

Thanks for clarification. Make sense to me.

MinminSun · September 14, 2020, 3:08am

Thanks for the proposal! Just courious about the schuedule primitives like cache_write and cache_read, since there are no stages in TensorIR.

spectrometerHBH · September 14, 2020, 4:01am

Thanks for your reply! @MinminSun

The cache_read/cache_write API accepts a Buffer and new scope as input, do some checks to ensure it brings no problem to read/write the Buffer into cache, and create new blocks to do the cache transfer.

mbaret · September 15, 2020, 9:04am

Thanks for this RFC, I think it’s a great idea and will help solve a number of issues I’ve been facing recently. I’m particularly interested in what ‘tensorize’ will look like for this new IR. Could you give a snippet as an example?

I’m also interested in what the interaction of this will be with the loop partition pass. Will this mean that each partitioned loop will then be individually schedulable?

Hzfengsy · September 15, 2020, 9:53am

Thank you for your interest.

Tensorize in TensorIR is completely different from the TE ones. In TensorIR, we use two functions (desc_func and intrin_func) to define an intrinsic. Here would be an example of intrinsic (Note that TensorIR is still WIP, so the API may be changed).

@tvm.hybrid.script
def desc_func(a: ty.handle, b: ty.handle, c: ty.handle) -> None:
    A = tir.match_buffer(a, [16, 16])
    B = tir.match_buffer(b, [16, 16])
    C = tir.match_buffer(c, [16, 16])

    with tir.block([16, 16, tir.reduce_axis(0, 16)], "root") as [vi, vj, vk]:
        for i, j, k in tir.grid(16, 16, 16):
            with tir.block([16, 16, tir.reduce_axis(0, 16)], "update") as [vii, vjj, vkk]:
                tir.bind(vii, vi + i)
                tir.bind(vjj, vj + j)
                tir.bind(vkk, vk + k)
                C[vii, vjj] = C[vii, vjj] + A[vii, vkk] * B[vjj, vkk]


@tvm.hybrid.script
def intrin_func(a: ty.handle, b: ty.handle, c: ty.handle) -> None:
    A = tir.match_buffer(a, [16, 16])
    B = tir.match_buffer(b, [16, 16])
    C = tir.match_buffer(c, [16, 16])

    with tir.block([16, 16, tir.reduce_axis(0, 16)], "root") as [vi, vj, vk]:
        tir.evaluate(tir.tvm_mma_sync(C.data, C.elem_offset // 256,
                                      A.data, A.elem_offset // 256,
                                      B.data, B.elem_offset // 256,
                                      C.data, C.elem_offset // 256,
                                      dtype="handle"))

Tensorize will match the sub-AST(usually is a block) with the desc_func, and then replace by intrin_func.

TensorIR is in the schedule level and has no coupling with low-level passes. However, we can directly schedule each loop directly and add primitives as you want.

mbaret · September 15, 2020, 10:06am

Thanks for this explanation. I’m interested if it might be possible to match tensor intrinsics with variable size? For example, Arm SVE introduces vector instructions of variable size.

Hzfengsy · September 15, 2020, 10:41am

Technically, it should support. However, due to time constraints, we have not yet supported.