Thanks for this explanation. I’m interested if it might be possible to match tensor intrinsics with variable size? For example, Arm SVE introduces vector instructions of variable size.
Technically, it should support. However, due to time constraints, we have not yet supported.
Thanks for the proposal! Looks quite interesting!
Out of curiosity,
-
The concat example you’ve shown where the original stage is represented in three blocks that seems to be assigning to the same buffer. I’m curious to know what if we want to move the concat (using compute_at, if possible ?) to a consumer of the concat’s output (to some loop of the consumer), how could it be done ? Will it create multiple blocks there as well ?
-
Since the proposed TensorIR enables scoping of scheduling transformations in terms of blocks, will there be a prospect of representing a full relay graph in TensorIR ?
for concat, we could introduce a reverse inlining primitive that inlines elemenwise operations(after concat) back to the concat, which should be helpful in many cases.
While it is possible to represent a full graph, we would still imagine relay being super useful as a coarse grained repr for graph level opt. So that would suggest to have a continued effort on multi-level repr(relay and tir)
Thanks for the clarification! I concur that such a primitive should be useful and would allow more flexible compute movements.
Regarding the full graph, I agree that relay (along with optimization) being very useful. I was thinking whether there would be a benefit of lowering the full graph to tensorIR post relay optimization rather than lowering each primitive function. I guess this has to do with how AutoTVM/Ansor will allow the exploration of schedules but I got a feeling that could be scoped via the “blocks” that would otherwise lead to explosion of search space. (Looking from an AoT angle here).
Moreover, may be that could lay a foundation to inter-primitive function optimizations later.
This is the right way to go. However I have two concern,
- How to fuse ops as much as possible? Basically fusion is copy propagation optimization in compilers, which is based on data flow analysis, but still lack of programming analysis in TVM now.
- TE tensorize can not handle some complex pattern matching, see https://github.com/apache/incubator-tvm/pull/1053, can we do 100% pattern matching in tir?
@xqdan Thank you for the valuable feedback! Fusion can be done automatically with some analysis provided in Ansor.
Do you have any other kind of analysis in mind that might be potentially useful?
Is Fusion in Ansor based on tir? For other transforms, you may checkout here, that’s what we’ve done in AKG. I can explain some if you are intrested.
@junrushao It’s better to know loops can be vectoried, permutable or distributied, isl can provide these information,so we can do loop optimization and tensorization/vectorization automatically.
@xqdan In Ansor, Fusion analysis is handled in TE with some straightforward heuristics, which I believe have covered our usecases. CC: @merrymercy @jcf94
Agree that ISL provides effective information about vectorization, and I believe there might be other competitive heuristics too. Tensorization is a more general topic that would be super interesting to explore
How is the compilation speed compared to the original TE? In Ansor/Autotvm, we have to compile a lot of schedules for feature extraction, so the speed of schedule transformation matters.
Do you have any benchmark results? Intuitively, I think the original TE will be faster because it can do a batched bound inference and AST construction. If it is true, how can we fix this performance gap?
@merrymercy I didn’t get it about batched bound inference, doesn’t Ansor use a pool of threads for massive bound inference?
E… @junrushao I guess @merrymercy 's opinion is that doing analysis in TE is quicker than using the ISL.
ISL is sure a powerful tool for loop analyse, but in my understanding we should lower the schedule to C code first before using ISL? Which I think is more time consuming.
Currently, Ansor applies some simple but useful analyses based on TE. Though it may not be as accurate as ISL does, but it’s cheap. Then we count on the tuning to try lots of uncertain schedules and find the best one by measuring.
@jcf94 @junrushao Sorry, both of you don’t understand my question correctly.
I mean the original TE is a declarative language so it can know all transformation before it starts to generate low-level AST. But the new schedule primitives are done imperatively. In the original TE, we can share some analysis results (e.g. dependency analysis), so it is expected to be faster.
@merrymercy Good question! Here’s an example of TIR’s schedule.
s = tir.create_schedule(original_func)
update = s.get_block("C")
i, j, k = s.get_axes(update)
i_o, i_i = s.split(i, bn)
j_o, j_i = s.split(j, bn)
k_o, k_i = s.split(k, 4)
s.reorder(i_o, j_o, k_o, k_i, i_i, j_i)
TIR’s schedule is not totally stateless. Scope info, dependency graph info is actively maintained during the scheduling process in class Schedule. We don’t calculate them each time we apply a new primitive. After lowering to TIR without blocks, we don’t maintain these info any more since it is not schedulable.
All in all, it is good to run the benchmark to compare them in practice. I hope I understand your question correctly.
When I read this RFC, I was confused because I read that the compilation flow currently goes from
TF/PyTorch/ONNX -> Relay -> TE (based on TE schedule) -> TIR -> C++/CUDA
And then I read:
TensorIR is a brand new low-level IR with full scheduling support. Here are some […]
And then see references to TIR
throughout the rest of the RFC, which seem unclear to me if these references are to the new TensorIR
, or the old TIR
.
Can we clarify both here and in the code base moving forward whether we are referring to TensorIR
or the old TIR
? I think there are two ways of doing this going forward and we should explicitly pick one:
- Have
TIR
refer to the old version of TIR, and always useTensorIR
when talking about the new IR. - Be clear that TensorIR moving forward will be replacing the old
TIR
, so that any reference toTIR
could be referring to the new or old version (and should be clarified if not obvious from context).
Someone more knowledgeable than me should pick one of those (or correct me and/or point out other options if I’ve gotten anything wrong here).
Yes, the ambiguity is something I was struggling with too, when having a conversation. May I ask what does the “T” of old TIR stands for ? TVM ?
TensorIR can be viewed as major feature enhancements(upgrade) to the TIR in master. That is why TensorIR and TIR are used interchangeably as they are supposed to be so.
Some of the elements like multidimensional buffer load and tvm script are already present as part of the unified IR effort.
The upgrade will happen quite naturally. With the TIR continue to support current code as as they are, and gains new scheduling capabilities with the new block constructs.
Hi,
I was wondering on the status of this RFC. Is there any PR or work in progress available?
Thanks
Hi TVM,
This idea is so cool, I think it is going to make it possible for mortals to use TVM effectively.
I have a couple of questions about the snippet of the scheduling language.
The three issues I have when programming TVM are:
- Too many variables in scope with meaningless names, and forgetting where they can from.
- Losing track of which axes need to be split identically
- Not understanding the semantics of compute_at and how it tells which axes line up.
It seems like maybe this fixes a couple of these. However the following still bugs me a bit:
s = tir.create_schedule(matmul)
update = s.get_block("C")
i, j, k = s.get_axes(update)
i_o, i_i = s.split(i, bn)
j_o, j_i = s.split(j, bn)
k_o, k_i = s.split(k, 4)
s.reorder(i_o, j_o, k_o, k_i, i_i, j_i)
Curious why:
- There are strings for get_block?
- Why not have split go into a named out/in tuple to discourage this _ style naming. It gets so messy so quickly.
- does this propose to fix the issue of having to repeat the identical splits for things like shared and local buffers that need to be done later in the code. (In order for compute at to work)
Thanks so much! Love the library, want to get to a point where I can teach it to my students effectively. /Sasha