[RFC] TensorIR: A schedulable IR for TVM

Thanks for your reply! @MinminSun

The cache_read/cache_write API accepts a Buffer and new scope as input, do some checks to ensure it brings no problem to read/write the Buffer into cache, and create new blocks to do the cache transfer.

Thanks for this RFC, I think itā€™s a great idea and will help solve a number of issues Iā€™ve been facing recently. Iā€™m particularly interested in what ā€˜tensorizeā€™ will look like for this new IR. Could you give a snippet as an example?

Iā€™m also interested in what the interaction of this will be with the loop partition pass. Will this mean that each partitioned loop will then be individually schedulable?

Thank you for your interest.

Tensorize in TensorIR is completely different from the TE ones. In TensorIR, we use two functions (desc_func and intrin_func) to define an intrinsic. Here would be an example of intrinsic (Note that TensorIR is still WIP, so the API may be changed).

@tvm.hybrid.script
def desc_func(a: ty.handle, b: ty.handle, c: ty.handle) -> None:
    A = tir.match_buffer(a, [16, 16])
    B = tir.match_buffer(b, [16, 16])
    C = tir.match_buffer(c, [16, 16])

    with tir.block([16, 16, tir.reduce_axis(0, 16)], "root") as [vi, vj, vk]:
        for i, j, k in tir.grid(16, 16, 16):
            with tir.block([16, 16, tir.reduce_axis(0, 16)], "update") as [vii, vjj, vkk]:
                tir.bind(vii, vi + i)
                tir.bind(vjj, vj + j)
                tir.bind(vkk, vk + k)
                C[vii, vjj] = C[vii, vjj] + A[vii, vkk] * B[vjj, vkk]


@tvm.hybrid.script
def intrin_func(a: ty.handle, b: ty.handle, c: ty.handle) -> None:
    A = tir.match_buffer(a, [16, 16])
    B = tir.match_buffer(b, [16, 16])
    C = tir.match_buffer(c, [16, 16])

    with tir.block([16, 16, tir.reduce_axis(0, 16)], "root") as [vi, vj, vk]:
        tir.evaluate(tir.tvm_mma_sync(C.data, C.elem_offset // 256,
                                      A.data, A.elem_offset // 256,
                                      B.data, B.elem_offset // 256,
                                      C.data, C.elem_offset // 256,
                                      dtype="handle"))

Tensorize will match the sub-AST(usually is a block) with the desc_func, and then replace by intrin_func.

TensorIR is in the schedule level and has no coupling with low-level passes. However, we can directly schedule each loop directly and add primitives as you want. :slight_smile:

2 Likes

Thanks for this explanation. Iā€™m interested if it might be possible to match tensor intrinsics with variable size? For example, Arm SVE introduces vector instructions of variable size.

Technically, it should support. However, due to time constraints, we have not yet supported.

Thanks for the proposal! Looks quite interesting!

Out of curiosity,

  1. The concat example youā€™ve shown where the original stage is represented in three blocks that seems to be assigning to the same buffer. Iā€™m curious to know what if we want to move the concat (using compute_at, if possible ?) to a consumer of the concatā€™s output (to some loop of the consumer), how could it be done ? Will it create multiple blocks there as well ?

  2. Since the proposed TensorIR enables scoping of scheduling transformations in terms of blocks, will there be a prospect of representing a full relay graph in TensorIR ?

for concat, we could introduce a reverse inlining primitive that inlines elemenwise operations(after concat) back to the concat, which should be helpful in many cases.

While it is possible to represent a full graph, we would still imagine relay being super useful as a coarse grained repr for graph level opt. So that would suggest to have a continued effort on multi-level repr(relay and tir)

2 Likes

Thanks for the clarification! I concur that such a primitive should be useful and would allow more flexible compute movements.

Regarding the full graph, I agree that relay (along with optimization) being very useful. I was thinking whether there would be a benefit of lowering the full graph to tensorIR post relay optimization rather than lowering each primitive function. I guess this has to do with how AutoTVM/Ansor will allow the exploration of schedules but I got a feeling that could be scoped via the ā€œblocksā€ that would otherwise lead to explosion of search space. (Looking from an AoT angle here).

Moreover, may be that could lay a foundation to inter-primitive function optimizations later.

1 Like

This is the right way to go. However I have two concern,

  1. How to fuse ops as much as possible? Basically fusion is copy propagation optimization in compilers, which is based on data flow analysis, but still lack of programming analysis in TVM now.
  2. TE tensorize can not handle some complex pattern matching, see https://github.com/apache/incubator-tvm/pull/1053, can we do 100% pattern matching in tir?

@xqdan Thank you for the valuable feedback! Fusion can be done automatically with some analysis provided in Ansor.

Do you have any other kind of analysis in mind that might be potentially useful?

1 Like

Is Fusion in Ansor based on tir? For other transforms, you may checkout here, thatā€™s what weā€™ve done in AKG. I can explain some if you are intrested.

2 Likes

@junrushao Itā€™s better to know loops can be vectoried, permutable or distributied, isl can provide these informationļ¼Œso we can do loop optimization and tensorization/vectorization automatically.

@xqdan In Ansor, Fusion analysis is handled in TE with some straightforward heuristics, which I believe have covered our usecases. CC: @merrymercy @jcf94

Agree that ISL provides effective information about vectorization, and I believe there might be other competitive heuristics too. Tensorization is a more general topic that would be super interesting to explore :slight_smile:

How is the compilation speed compared to the original TE? In Ansor/Autotvm, we have to compile a lot of schedules for feature extraction, so the speed of schedule transformation matters.

Do you have any benchmark results? Intuitively, I think the original TE will be faster because it can do a batched bound inference and AST construction. If it is true, how can we fix this performance gap?

@merrymercy I didnā€™t get it about batched bound inference, doesnā€™t Ansor use a pool of threads for massive bound inference?

Eā€¦ @junrushao I guess @merrymercy 's opinion is that doing analysis in TE is quicker than using the ISL.

ISL is sure a powerful tool for loop analyse, but in my understanding we should lower the schedule to C code first before using ISL? Which I think is more time consuming.

Currently, Ansor applies some simple but useful analyses based on TE. Though it may not be as accurate as ISL does, but itā€™s cheap. Then we count on the tuning to try lots of uncertain schedules and find the best one by measuring.

@jcf94 @junrushao Sorry, both of you donā€™t understand my question correctly.

I mean the original TE is a declarative language so it can know all transformation before it starts to generate low-level AST. But the new schedule primitives are done imperatively. In the original TE, we can share some analysis results (e.g. dependency analysis), so it is expected to be faster.

1 Like

@merrymercy Good question! Hereā€™s an example of TIRā€™s schedule.

s = tir.create_schedule(original_func)

update = s.get_block("C")
i, j, k = s.get_axes(update)
i_o, i_i = s.split(i, bn)
j_o, j_i = s.split(j, bn)
k_o, k_i = s.split(k, 4)
s.reorder(i_o, j_o, k_o, k_i, i_i, j_i)

TIRā€™s schedule is not totally stateless. Scope info, dependency graph info is actively maintained during the scheduling process in class Schedule. We donā€™t calculate them each time we apply a new primitive. After lowering to TIR without blocks, we donā€™t maintain these info any more since it is not schedulable.

All in all, it is good to run the benchmark to compare them in practice. I hope I understand your question correctly. :smile:

2 Likes

When I read this RFC, I was confused because I read that the compilation flow currently goes from

TF/PyTorch/ONNX -> Relay -> TE (based on TE schedule) -> TIR -> C++/CUDA

And then I read:

TensorIR is a brand new low-level IR with full scheduling support. Here are some [ā€¦]

And then see references to TIR throughout the rest of the RFC, which seem unclear to me if these references are to the new TensorIR, or the old TIR.

Can we clarify both here and in the code base moving forward whether we are referring to TensorIR or the old TIR? I think there are two ways of doing this going forward and we should explicitly pick one:

  1. Have TIR refer to the old version of TIR, and always use TensorIR when talking about the new IR.
  2. Be clear that TensorIR moving forward will be replacing the old TIR, so that any reference to TIR could be referring to the new or old version (and should be clarified if not obvious from context).

Someone more knowledgeable than me should pick one of those (or correct me and/or point out other options if Iā€™ve gotten anything wrong here).

1 Like

Yes, the ambiguity is something I was struggling with too, when having a conversation. May I ask what does the ā€œTā€ of old TIR stands for ? TVM ?