Thanks for your reply! @MinminSun
The cache_read/cache_write API accepts a Buffer and new scope as input, do some checks to ensure it brings no problem to read/write the Buffer into cache, and create new blocks to do the cache transfer.
Thanks for your reply! @MinminSun
The cache_read/cache_write API accepts a Buffer and new scope as input, do some checks to ensure it brings no problem to read/write the Buffer into cache, and create new blocks to do the cache transfer.
Thanks for this RFC, I think itās a great idea and will help solve a number of issues Iāve been facing recently. Iām particularly interested in what ātensorizeā will look like for this new IR. Could you give a snippet as an example?
Iām also interested in what the interaction of this will be with the loop partition pass. Will this mean that each partitioned loop will then be individually schedulable?
Thank you for your interest.
Tensorize in TensorIR is completely different from the TE ones. In TensorIR, we use two functions (desc_func and intrin_func) to define an intrinsic. Here would be an example of intrinsic (Note that TensorIR is still WIP, so the API may be changed).
@tvm.hybrid.script
def desc_func(a: ty.handle, b: ty.handle, c: ty.handle) -> None:
A = tir.match_buffer(a, [16, 16])
B = tir.match_buffer(b, [16, 16])
C = tir.match_buffer(c, [16, 16])
with tir.block([16, 16, tir.reduce_axis(0, 16)], "root") as [vi, vj, vk]:
for i, j, k in tir.grid(16, 16, 16):
with tir.block([16, 16, tir.reduce_axis(0, 16)], "update") as [vii, vjj, vkk]:
tir.bind(vii, vi + i)
tir.bind(vjj, vj + j)
tir.bind(vkk, vk + k)
C[vii, vjj] = C[vii, vjj] + A[vii, vkk] * B[vjj, vkk]
@tvm.hybrid.script
def intrin_func(a: ty.handle, b: ty.handle, c: ty.handle) -> None:
A = tir.match_buffer(a, [16, 16])
B = tir.match_buffer(b, [16, 16])
C = tir.match_buffer(c, [16, 16])
with tir.block([16, 16, tir.reduce_axis(0, 16)], "root") as [vi, vj, vk]:
tir.evaluate(tir.tvm_mma_sync(C.data, C.elem_offset // 256,
A.data, A.elem_offset // 256,
B.data, B.elem_offset // 256,
C.data, C.elem_offset // 256,
dtype="handle"))
Tensorize will match the sub-AST(usually is a block) with the desc_func
, and then replace by intrin_func
.
TensorIR is in the schedule level and has no coupling with low-level passes. However, we can directly schedule each loop directly and add primitives as you want.
Thanks for this explanation. Iām interested if it might be possible to match tensor intrinsics with variable size? For example, Arm SVE introduces vector instructions of variable size.
Technically, it should support. However, due to time constraints, we have not yet supported.
Thanks for the proposal! Looks quite interesting!
Out of curiosity,
The concat example youāve shown where the original stage is represented in three blocks that seems to be assigning to the same buffer. Iām curious to know what if we want to move the concat (using compute_at, if possible ?) to a consumer of the concatās output (to some loop of the consumer), how could it be done ? Will it create multiple blocks there as well ?
Since the proposed TensorIR enables scoping of scheduling transformations in terms of blocks, will there be a prospect of representing a full relay graph in TensorIR ?
for concat, we could introduce a reverse inlining primitive that inlines elemenwise operations(after concat) back to the concat, which should be helpful in many cases.
While it is possible to represent a full graph, we would still imagine relay being super useful as a coarse grained repr for graph level opt. So that would suggest to have a continued effort on multi-level repr(relay and tir)
Thanks for the clarification! I concur that such a primitive should be useful and would allow more flexible compute movements.
Regarding the full graph, I agree that relay (along with optimization) being very useful. I was thinking whether there would be a benefit of lowering the full graph to tensorIR post relay optimization rather than lowering each primitive function. I guess this has to do with how AutoTVM/Ansor will allow the exploration of schedules but I got a feeling that could be scoped via the āblocksā that would otherwise lead to explosion of search space. (Looking from an AoT angle here).
Moreover, may be that could lay a foundation to inter-primitive function optimizations later.
This is the right way to go. However I have two concern,
@xqdan Thank you for the valuable feedback! Fusion can be done automatically with some analysis provided in Ansor.
Do you have any other kind of analysis in mind that might be potentially useful?
Is Fusion in Ansor based on tir? For other transforms, you may checkout here, thatās what weāve done in AKG. I can explain some if you are intrested.
@junrushao Itās better to know loops can be vectoried, permutable or distributied, isl can provide these informationļ¼so we can do loop optimization and tensorization/vectorization automatically.
@xqdan In Ansor, Fusion analysis is handled in TE with some straightforward heuristics, which I believe have covered our usecases. CC: @merrymercy @jcf94
Agree that ISL provides effective information about vectorization, and I believe there might be other competitive heuristics too. Tensorization is a more general topic that would be super interesting to explore
How is the compilation speed compared to the original TE? In Ansor/Autotvm, we have to compile a lot of schedules for feature extraction, so the speed of schedule transformation matters.
Do you have any benchmark results? Intuitively, I think the original TE will be faster because it can do a batched bound inference and AST construction. If it is true, how can we fix this performance gap?
@merrymercy I didnāt get it about batched bound inference, doesnāt Ansor use a pool of threads for massive bound inference?
Eā¦ @junrushao I guess @merrymercy 's opinion is that doing analysis in TE is quicker than using the ISL.
ISL is sure a powerful tool for loop analyse, but in my understanding we should lower the schedule to C code first before using ISL? Which I think is more time consuming.
Currently, Ansor applies some simple but useful analyses based on TE. Though it may not be as accurate as ISL does, but itās cheap. Then we count on the tuning to try lots of uncertain schedules and find the best one by measuring.
@jcf94 @junrushao Sorry, both of you donāt understand my question correctly.
I mean the original TE is a declarative language so it can know all transformation before it starts to generate low-level AST. But the new schedule primitives are done imperatively. In the original TE, we can share some analysis results (e.g. dependency analysis), so it is expected to be faster.
@merrymercy Good question! Hereās an example of TIRās schedule.
s = tir.create_schedule(original_func)
update = s.get_block("C")
i, j, k = s.get_axes(update)
i_o, i_i = s.split(i, bn)
j_o, j_i = s.split(j, bn)
k_o, k_i = s.split(k, 4)
s.reorder(i_o, j_o, k_o, k_i, i_i, j_i)
TIRās schedule is not totally stateless. Scope info, dependency graph info is actively maintained during the scheduling process in class Schedule. We donāt calculate them each time we apply a new primitive. After lowering to TIR without blocks, we donāt maintain these info any more since it is not schedulable.
All in all, it is good to run the benchmark to compare them in practice. I hope I understand your question correctly.
When I read this RFC, I was confused because I read that the compilation flow currently goes from
TF/PyTorch/ONNX -> Relay -> TE (based on TE schedule) -> TIR -> C++/CUDA
And then I read:
TensorIR is a brand new low-level IR with full scheduling support. Here are some [ā¦]
And then see references to TIR
throughout the rest of the RFC, which seem unclear to me if these references are to the new TensorIR
, or the old TIR
.
Can we clarify both here and in the code base moving forward whether we are referring to TensorIR
or the old TIR
? I think there are two ways of doing this going forward and we should explicitly pick one:
TIR
refer to the old version of TIR, and always use TensorIR
when talking about the new IR.TIR
, so that any reference to TIR
could be referring to the new or old version (and should be clarified if not obvious from context).Someone more knowledgeable than me should pick one of those (or correct me and/or point out other options if Iāve gotten anything wrong here).
Yes, the ambiguity is something I was struggling with too, when having a conversation. May I ask what does the āTā of old TIR stands for ? TVM ?