Different Low-Level IRs?

I have stumbled over some documentation that indicates to me that there are different Low-Level IRs. One seems to come with a block statement to group computations, and the other seems to not include block. Is that actually the case, or am I missing something else? If they are different, what are use-cases for each of them?

I’m assuming this question is about TIR and not Relax (as Relax also has something called DataflowBlock which is completely different).

When it comes to TIR, the current method of scheduling it involves directly modifying a TIR AST through schedule primitives, and this TIR (Originally mentioned in RFC as TensorIR) has blocks, which eventually gets lowered to a version of TIR without blocks through a combination of a couple of passes.

Earlier, scheduling used to happen directly on Tensor Expressions (TE), and then the scheduled TE used to get lowered directly to TIR, which did not have any blocks. This lowered TIR, is the same as what you get when lowering from TensorIR with blocks as well.

Essentially, after lowering they’re the same, but blocks can appear in TIR when we view it as TensorIR and it gets scheduled directly as opposed to going from TE.

Hope I didn’t confuse you further, feel free to ask more questions if you’re confused

So are TensorIR and TIR different things?

My goal is to integrate custom hardware with AutoTVM, so I wanted to start by manually defining a schedule and then transforming that into AutoTVM.

I started out by defining a computation with te.compute and then split, reorder etc. to tile it for my case (just to learn how it all works). But now I am wondering if that is still the correct flow? I didn’t find any equivalent to te.compute in TIR, so starting with TE just seemed natural.

But now I am confused on what level I need to operate on. A few examples I found use InjectCopyIntrin to insert data movement instructions, but they also seem to operate on the IR with Blocks. But coming from TE via the lower function, I don’t see any Block in the lowered IR. How does this all flow together, and what is the best path to use all the different technologies?

TIR and TensorIR are the same thing. Originally we had TIR that got generated from te after scheduling in te.

Now the same TIR has been extended to include BlockNode to represent blocks, which became a primary data structure to allow scheduling directly on TIR. After scheduling with BlockNode, it gets lowered to the same TIR without Blocks, thus generating a similar final TIR to what we would get from te.

If you’re working with te.compute and basic schedules on that, I’m guessing you’re using te schedule primitives, in which case, you can ignore the TIR block based scheduling.

Scheduling te is still supported, but scheduling with TIR might be preferrable as it contains a lot more scheduling primitives. You could still use te.compute to write the compute, then generate an initial PrimFunc in TIR format with te.create_prim_func. Then you can use TIR scheduling primitives (TIR based primitives are a superset of te schedule primitives, so you should be able to do anything that is already possible in te, with just TIR scheduling).

As for AutoTVM and tuning, there is a bit of history. “AutoTVM” is the original tuning infrastructure added to work with te, which expects the users to define manual templates as guidance for tuning

Then there was the introduction of Ansor, which tried to do more automated tuning based on pre-defined scheduling rules.

Both AutoTVM and Ansor work mostly on TE, as far as I understand.

After TIR based scheduling was introduced, there was the introduction of Metaschedule, which was designed to work with TIR based scheduling and blocks, so if you move to TIR based scheduling, this might be the right thing to explore.

2 Likes

Thank you for the in-depth explanation!

After TIR based scheduling was introduced, there was the introduction of Metaschedule, which was designed to work with TIR based scheduling and blocks, so if you move to TIR based scheduling, this might be the right thing to explore.

Is MetaSchedule something that can be expanded for custom hardware? That is the part that wasn’t clear to me also with regard to AutoSchedule and why I started with AutoTune.

I have one more question regarding TIR: How do you begin writing a schedule? For TE, you can define the computation on an abstract level as a reduction, but I can’t find any intro on how you work with the TIR scheduling. Everything seems to indicate that you have to manually write out the loop nests.

You can define the computation in a te style compute as well and then generate the TIR to be scheduled. Below is a small example for matmul:

import tvm
from tvm import te

n = 128
n = te.size_var("n")

A = te.placeholder((n,n), dtype="float32", name="A")
B = te.placeholder((n,n), dtype="float32", name="B")
k = te.reduce_axis((0,n), name="k")
C = te.compute((n,n), lambda i, j: te.sum(A[k,i]*B[k,j],axis=[k]), name="C")

# create tir prim_func
tir_func = te.create_prim_func([A,B,C])
print(tir_func)

# Create TIR schedule
sch = tvm.tir.Schedule(tir_func)

# Schedule with TIR schedules
c_block = sch.get_block("C")
i, j, k = sch.get_loops(c_block)

sch.parallel(i)
sch.reorder(k, j)
sch.vectorize(j)

# Print the scheduled mod
print(sch.mod)

I’m not an expert with metaschedule, but I’ll try to explain what I understand. I guess you’re asking about being able to tune on custom hardware, and yes, as far as I understand, any hardware can be supported with some basic implementation to define how to build and run on that hardware.

This can be done by adding support for building for that hardware and supporting RPC runner as far as I understand, but I guess @junrushao might be a better person to answer this question.