Thanks for the proposal! This definitely opens more opportunities for performance optimization. Two questions for clarification:
-
IIUC, based on the proposal and discussion, we will have both TE and TIR, but TE is more like a frontend wrapper of TIR to serve some users that prefer to write high-level DSL. Then, what will we do with the TE schedule primitives? Intuitively, we should still keep them; otherwise TE writers will have no way to schedule their computes, because they know nothing about TIR and blocks.
-
Does this proposal support dynamic shape (i.e.,
Any
)? For example, can we have something like:@tvm.hybrid.script def matmul(a: ty.handle, b: ty.handle, c: ty.handle) -> None: C = tir.match_buffer(c, (1024, 1024), "float32") A = tir.match_buffer(a, (1024, Any), "float32") B = tir.match_buffer(b, (Any, 1024), "float32") reducer = tir.comm_reducer(lambda x, y: x + y, tir.float32(0)) with tir.block([1024, 1024, tir.reduce_axis(0, 1024)], "C") as [vi, vj, vk]: reducer.step(C[vi, vj], A[vi, vk] * B[vk, vj]) s = tir.create_schedule(matmul) update = s.get_block("C") i, j, k = s.get_axes(update) i_o, i_i = s.split(i, bn) j_o, j_i = s.split(j, bn) k_o, k_i = s.split(k, 4)
In this case, the length of
vk
(ork
) isAny
. Can we still applysplit
to it with a fixed factor