Is there currently any (even experimental) support for autotvm/auto-scheduler atop TensorIR?
Thanks!
Is there currently any (even experimental) support for autotvm/auto-scheduler atop TensorIR?
Thanks!
Great! Thank you for your help!
Could you point us to a link with an example for using meta schedule with TensorIR? Doesn’t need to be a full tutorial, a unit test or experimental code would also be sufficient
Hi TVM.
I’m very interested in TensorIR. And I’m now testing a simple matrix multiplication and want to bind the axis to GPU thread. But there is an error saying that the child block is neither a complete block nor a reduction block. Would you please help me address this issue?
My code is shown below.
@tvm.script.ir_module
class MyModule:
@T.prim_func
def main(a: T.handle, b: T.handle, c: T.handle):
T.func_attr({"global_symbol": "main", "tir.noalias": True})
A = T.match_buffer(a, [128, 128])
B = T.match_buffer(b, [128, 128])
C = T.match_buffer(c, [128, 128])
for i, j, k in T.grid(128, 128, 128):
with T.block("update"):
vi, vj, vk= T.axis.remap("SSR", [i, j, k])
C[vi, vj] = C[vi, vj] + A[vi, vk] * B[vj, vk]
ir_module = MyModule
sch = tvm.tir.Schedule(ir_module)
block_b = sch.get_block("update")
(i,j,k) = sch.get_loops(block_b)
sch.bind(i,"threadIdx.x")
Hi @RenyuanLIU. A reduction block contain both init part and update part. The full block should be
with T.block():
vi, vj, vk= T.axis.remap("SSR", [i, j, k])
with T.init():
C[vi, vj] = 0
C[vi, vj] = C[vi, vj] + A[vi, vk] * B[vj, vk]
Those seems to be great lessons that we can use to enhance tutorials and docs!
Thank you very much!
Hi @Hzfengsy . Thank you for your help last time.
I am now trying to test the matrix multiplication C = A*B with different matrix sizes of A and B. if it is possible to write something like:
A = T.match_buffer(a, [N,N])
And every time I get A and B, I assign the size to N?
I found that you said tensorIR did not support dynamic buffer now. But for this case, every time I get A and B, the N is determined. So I’m wondering is there any solution to this problem.
Thanks
Please see an API called specialize
Thank you very much! I appreciate it.
Hi, TVM. I am now confused about the cache_read/cache_write. When I wanted to allocate a shared/local memory for CUDA, I could not define the size of the shared/local memory.
For example:
@tvm.script.ir_module
class MyModule:
@T.prim_func
def main(a: T.handle, b: T.handle, c: T.handle):
T.func_attr({"global_symbol": "main", "tir.noalias": True})
A = T.match_buffer(a, [128, 128])
B = T.match_buffer(b, [128, 128])
C = T.match_buffer(c, [128, 128])
for io in T.thread_binding(0,16, thread = "threadIdx.y"):
for jo in T.thread_binding(0,16,thread = "threadIdx.x"):
for i,j,k in T.grid(8,8,128):
with T.block("C"):
vi, vj, vk= T.axis.remap("SSR", [i, j, k])
C[vi, vj] = C[vi, vj] + A[vi, vk] * B[vj, vk]
func = MyModule
sch = tir.Schdule(func)
C_local = sch.cache_write(c,0,"local")
Here I would like to set the C local memory size as 8*8, but it was automatically set as 128*128, the size of the entire C.
Would you please help me with this issue?
@RenyuanLIU There is a pass CompactBufferAllocation
that makes the allocated buffer region smaller.
See the usage here: https://github.com/apache/tvm/blob/174d09ee2cef1ea2caab4c84e0bd58d90c09178f/tests/python/unittest/test_tir_transform_compact_buffer_region.py
Hi TVM.
I’m now still trying to allocate a local memory for matrix multiplication. I found that the produced CUDA code for the T.init() was an if statement, and it checked whether the reduce axis is 0. For example:
@T.prim_func
def matmul(
A: T.Buffer[(512, 512), "float32"],
B: T.Buffer[(512, 512), "float32"],
C: T.Buffer[(512, 512), "float32"],
) -> None:
# function attr dict
T.func_attr({"global_symbol": "main", "tir.noalias": True})
# body
# with T.block("root")
for i0, i1, i2 in T.grid(512, 512, 512):
with T.block("C"):
i, j, k = T.axis.remap("SSR", [i0, i1, i2])
# T.reads(C[i, j], A[i, k], B[k, j])
# T.writes(C[i, j])
with T.init():
C[i, j] = T.float32(0)
C[i, j] = C[i, j] + A[i, k] * B[k, j]
func = matmul
sch = tir.Schedule(func, debug_mask="all")
c = sch.get_block("C")
c_local = sch.cache_write(c, 0, "local")
i, j, k = sch.get_loops(c)
sch.bind(i,"threadIdx.x")
sch.reverse_compute_at(c_local, i)
And the CUDA code is:
extern "C" __global__ void __launch_bounds__(512) main_kernel0(float* __restrict__ A, float* __restrict__ B, float* __restrict__ C) {
float C_local[512];
for (int i1 = 0; i1 < 512; ++i1) {
for (int i2 = 0; i2 < 512; ++i2) {
if (i2 == 0) {
C_local[i1] = 0.000000e+00f;
}
C_local[i1] = (C_local[i1] + (A[((((int)threadIdx.x) * 512) + i2)] * B[((i2 * 512) + i1)]));
}
}
for (int ax0 = 0; ax0 < 512; ++ax0) {
C[((((int)threadIdx.x) * 512) + ax0)] = C_local[ax0];
}
}
I am wondering if it is possible to make the initialization outside of the nested for loop? So there is no need to run the if statement every time. Like this:
extern "C" __global__ void __launch_bounds__(512) main_kernel0(float* __restrict__ A, float* __restrict__ B, float* __restrict__ C) {
float C_local[512];
for (int i1 = 0; i1 < 512; ++i1) {
C_local[i1] = 0.000000e+00f;
}
for (int i1 = 0; i1 < 512; ++i1) {
for (int i2 = 0; i2 < 512; ++i2) {
C_local[i1] = (C_local[i1] + (A[((((int)threadIdx.x) * 512) + i2)] * B[((i2 * 512) + i1)]));
}
}
for (int ax0 = 0; ax0 < 512; ++ax0) {
C[((((int)threadIdx.x) * 512) + ax0)] = C_local[ax0];
}
}
I did not find a TIR API to do this. Would you please help me with this issue? Thank you in advance.
Hi TVM,
I met a problem when I tried to allocate a matrix indexed by another matrix to a shared/local memory. For example:
@tvm.script.ir_module
class MyModule:
@T.prim_func
def main(a: T.handle, b: T.handle, c: T.handle, d: T.handle):
T.func_attr({"global_symbol": "main", "tir.noalias": True})
A = T.match_buffer(a, [128, 128], "float32")
B = T.match_buffer(b, [128, 128], "int32")
C = T.match_buffer(c, [128, 1], "float32")
with T.block("test"):
for i, j in T.grid(128,128):
with T.block("C"):
vi, vj = T.axis.remap("SR", [i,j])
with T.init():
C[vi,0] = 0.
C[vi,0] += A[B[vi,vj],B[vi,vj]]
ir_module = MyModule
sch = tvm.tir.Schedule(ir_module)
c = sch.get_block("C")
i, j= sch.get_loops(c)
ib, io, ii = sch.split(i, factors = [None, 16, 8])
jo, ji = sch.split(j, factors = [None, 4])
sch.reorder(ib,io,jo,ji,ii)
sch.bind(ib, "blockIdx.x")
sch.bind(io, "threadIdx.x")
C_local = sch.cache_write(c, 0, "local")
sch.reverse_compute_at(C_local, io)
A_local = sch.cache_read(c, 1, "local")
sch.compute_at(A_local, jo)
It returned an error, “Cannot match type handle vs int32”.
I’m wondering whether TIR supports splitting a matrix indexed by another one and allocate it? Thank you in advance.