[RFC] TensorIR: A schedulable IR for TVM

So the scenario is like you can choose to use TE or TIR to write a compute, but if you choose TE, you have to first lower it to TIR and then add schedule primitives?

IIUC, it seems to me that this is nontrivial, because TIR was not written by human and you may need to first print it out to figure out how to schedule it. It sounds more straightforward to keep TE schedule as syntactic sugar. At least you can get the sense about how the schedule looks like by tracing the Python code.

3 Likes

Because there is a 1-1 mapping between te.Stage and Block. It should actually not be hard to use tir schedule to schedule a te compute generated PrimFunc (either by getting block via name, or pragmatically traverse the blocks like we do pragmatically on stages). But i agree that we can keep te.schedule for a bit.

Thanks for clarification. Make sense to me.

Thanks for the proposal! Just courious about the schuedule primitives like cache_write and cache_read, since there are no stages in TensorIR.

Thanks for your reply! @MinminSun

The cache_read/cache_write API accepts a Buffer and new scope as input, do some checks to ensure it brings no problem to read/write the Buffer into cache, and create new blocks to do the cache transfer.

Thanks for this RFC, I think it’s a great idea and will help solve a number of issues I’ve been facing recently. I’m particularly interested in what ‘tensorize’ will look like for this new IR. Could you give a snippet as an example?

I’m also interested in what the interaction of this will be with the loop partition pass. Will this mean that each partitioned loop will then be individually schedulable?

Thank you for your interest.

Tensorize in TensorIR is completely different from the TE ones. In TensorIR, we use two functions (desc_func and intrin_func) to define an intrinsic. Here would be an example of intrinsic (Note that TensorIR is still WIP, so the API may be changed).

@tvm.hybrid.script
def desc_func(a: ty.handle, b: ty.handle, c: ty.handle) -> None:
    A = tir.match_buffer(a, [16, 16])
    B = tir.match_buffer(b, [16, 16])
    C = tir.match_buffer(c, [16, 16])

    with tir.block([16, 16, tir.reduce_axis(0, 16)], "root") as [vi, vj, vk]:
        for i, j, k in tir.grid(16, 16, 16):
            with tir.block([16, 16, tir.reduce_axis(0, 16)], "update") as [vii, vjj, vkk]:
                tir.bind(vii, vi + i)
                tir.bind(vjj, vj + j)
                tir.bind(vkk, vk + k)
                C[vii, vjj] = C[vii, vjj] + A[vii, vkk] * B[vjj, vkk]


@tvm.hybrid.script
def intrin_func(a: ty.handle, b: ty.handle, c: ty.handle) -> None:
    A = tir.match_buffer(a, [16, 16])
    B = tir.match_buffer(b, [16, 16])
    C = tir.match_buffer(c, [16, 16])

    with tir.block([16, 16, tir.reduce_axis(0, 16)], "root") as [vi, vj, vk]:
        tir.evaluate(tir.tvm_mma_sync(C.data, C.elem_offset // 256,
                                      A.data, A.elem_offset // 256,
                                      B.data, B.elem_offset // 256,
                                      C.data, C.elem_offset // 256,
                                      dtype="handle"))

Tensorize will match the sub-AST(usually is a block) with the desc_func, and then replace by intrin_func.

TensorIR is in the schedule level and has no coupling with low-level passes. However, we can directly schedule each loop directly and add primitives as you want. :slight_smile:

2 Likes

Thanks for this explanation. I’m interested if it might be possible to match tensor intrinsics with variable size? For example, Arm SVE introduces vector instructions of variable size.

Technically, it should support. However, due to time constraints, we have not yet supported.

Thanks for the proposal! Looks quite interesting!

Out of curiosity,

  1. The concat example you’ve shown where the original stage is represented in three blocks that seems to be assigning to the same buffer. I’m curious to know what if we want to move the concat (using compute_at, if possible ?) to a consumer of the concat’s output (to some loop of the consumer), how could it be done ? Will it create multiple blocks there as well ?

  2. Since the proposed TensorIR enables scoping of scheduling transformations in terms of blocks, will there be a prospect of representing a full relay graph in TensorIR ?

for concat, we could introduce a reverse inlining primitive that inlines elemenwise operations(after concat) back to the concat, which should be helpful in many cases.

While it is possible to represent a full graph, we would still imagine relay being super useful as a coarse grained repr for graph level opt. So that would suggest to have a continued effort on multi-level repr(relay and tir)

2 Likes

Thanks for the clarification! I concur that such a primitive should be useful and would allow more flexible compute movements.

Regarding the full graph, I agree that relay (along with optimization) being very useful. I was thinking whether there would be a benefit of lowering the full graph to tensorIR post relay optimization rather than lowering each primitive function. I guess this has to do with how AutoTVM/Ansor will allow the exploration of schedules but I got a feeling that could be scoped via the “blocks” that would otherwise lead to explosion of search space. (Looking from an AoT angle here).

Moreover, may be that could lay a foundation to inter-primitive function optimizations later.

1 Like

This is the right way to go. However I have two concern,

  1. How to fuse ops as much as possible? Basically fusion is copy propagation optimization in compilers, which is based on data flow analysis, but still lack of programming analysis in TVM now.
  2. TE tensorize can not handle some complex pattern matching, see https://github.com/apache/incubator-tvm/pull/1053, can we do 100% pattern matching in tir?

@xqdan Thank you for the valuable feedback! Fusion can be done automatically with some analysis provided in Ansor.

Do you have any other kind of analysis in mind that might be potentially useful?

1 Like

Is Fusion in Ansor based on tir? For other transforms, you may checkout here, that’s what we’ve done in AKG. I can explain some if you are intrested.

2 Likes

@junrushao It’s better to know loops can be vectoried, permutable or distributied, isl can provide these information,so we can do loop optimization and tensorization/vectorization automatically.

@xqdan In Ansor, Fusion analysis is handled in TE with some straightforward heuristics, which I believe have covered our usecases. CC: @merrymercy @jcf94

Agree that ISL provides effective information about vectorization, and I believe there might be other competitive heuristics too. Tensorization is a more general topic that would be super interesting to explore :slight_smile:

How is the compilation speed compared to the original TE? In Ansor/Autotvm, we have to compile a lot of schedules for feature extraction, so the speed of schedule transformation matters.

Do you have any benchmark results? Intuitively, I think the original TE will be faster because it can do a batched bound inference and AST construction. If it is true, how can we fix this performance gap?

@merrymercy I didn’t get it about batched bound inference, doesn’t Ansor use a pool of threads for massive bound inference?

E… @junrushao I guess @merrymercy 's opinion is that doing analysis in TE is quicker than using the ISL.

ISL is sure a powerful tool for loop analyse, but in my understanding we should lower the schedule to C code first before using ISL? Which I think is more time consuming.

Currently, Ansor applies some simple but useful analyses based on TE. Though it may not be as accurate as ISL does, but it’s cheap. Then we count on the tuning to try lots of uncertain schedules and find the best one by measuring.