Hey @srush,
Thanks for your valuable feedback! Please allow me to try to explain the design rationale below:
Q1. There are strings for get_block?
Yeah. One design principle we hold for TensorIR is that all needed for scheduling is contained in the TensorIR python syntax, so that there is no “mysteriously hidden” information any more. Given a TensorIR script in python, we can schedule on it.
In the particular case, the string here is the name of the block in the text format.
Here is the example:
@tvm.script.tir
def matmul(x: ty.handle, y: ty.handle, z: ty.handle) -> None:
X = tir.match_buffer(x, [128, 128], "float32")
Y = tir.match_buffer(y, [128, 128], "float32")
Z = tir.match_buffer(z, [128, 128], "float32")
# ⬇️ name of the block is "Z"
with tir.block([128, 128, tir.reduce_axis(0, 128)], "Z") as [i, j, k]:
with tir.init():
Z[i, j] = tir.float32(0)
Z[i, j] = Z[i, j] + (X[i, k] * Y[k, j])
"""Print TensorIR in python syntax"""
print(tvm.script.asscript(matmul))
"""
Create a schedule
Scehdule is TensorIR => TensorIR
so we can print the latest TensorIR at any step of scheduling
"""
sch = tvm.tir.create_schedule(matmul)
ax0, ax1 = sch.split(...)
"""Print TensorIR at any step"""
print(tvm.script.asscript(sch.func))
sch.fuse(...)
print(tvm.script.asscript(sch.func))
"""
We can print the loops and blocks too into syntax like:
for i in range(0, 100)
"""
print(ax0)
Q2. Why not have split go into a named out/in tuple to discourage this _ style naming. It gets so messy so quickly
Agreed. We do find the _ style naming annoying in our experiments, especially when scheduling gets complicated and it generates really horrible names like i0_outer_outer_outer_outer_i1_outer_outer_outer_outer_fused_i2_outer_outer_outer_outer_fused_i3_outer_outer_outer_outer_fused
.
We have some internal strawman proposals for syntactic sugars, but converged to a perfect solution yet.
Solution 1. Allow splitting by multiple factors + accept customized naming of axes.
a, b, c, d = s.split(axes, factors=[None, 2, 4, 8], names=["a", "b", "c", "d"])
Note that the None
in the factors means letting the schedule to infer.
Solution 2. Allow splitting by multiple factors + einsum-like style API + get axes by name
i0, _, _, _ = s.split("i => i0, i1, i2, i3", factors=[None, 2, 4, 8])
# or allow retrieve by name
i0 = s.get_axis(name="i0")
We are open to new proposals too
Q3. Does this propose to fix the issue of having to repeat the identical splits for things like shared and local buffers that need to be done later in the code. (In order for compute at to work)
In short, yes, and it is solved by introducing a new scheduling primitive reverse_compute_at
.
Definition of compute_at
. Given a producer and a consumer, compute_at
allows to compute part of the producer’s region under one of the consumer’s loop.
Definition of reverse_compute_at
. Given a producer and a consumer, reverse_compute_at
allows to compute part of the consumer’s region under one of the producer’s loop.
Our typical usecase. We have a heavy producer, like conv2d, and a light consumer, like what is generated by cache_write
, or a small ReLU, and we want to fuse them for better locality.
Why bother duplicated splitting in compute_at
. With compute_at
, we are moving the producer under a loop of the consumer. First, user has to split the consumer, otherwise the user doesn’t even know which axis to be computed at; Second, user has to split the producer so that the other tiles are correctly positions - that is why it is so lengthy and tedious.
Why reverse_compute_at
avoids duplicated splitting. In this case, we only need to split the producer, then put the consumer under a specific loop of the producer. Then we don’t have to do any duplicate splitting
CC: @tqchen @Hzfengsy @spectrometerHBH @vinx13 @masahi