[RFC] TensorIR: A schedulable IR for TVM

jkosaian · January 7, 2022, 2:45pm

Is there currently any (even experimental) support for autotvm/auto-scheduler atop TensorIR?

Thanks!

junrushao · January 7, 2022, 6:42pm

Yep @jkosaian , and that’s meta schedule

jkosaian · January 8, 2022, 12:00pm

Great! Thank you for your help!

MJKlaiber · January 13, 2022, 3:58pm

Could you point us to a link with an example for using meta schedule with TensorIR? Doesn’t need to be a full tutorial, a unit test or experimental code would also be sufficient

RenyuanLIU · February 4, 2022, 10:42pm

Hi TVM.

I’m very interested in TensorIR. And I’m now testing a simple matrix multiplication and want to bind the axis to GPU thread. But there is an error saying that the child block is neither a complete block nor a reduction block. Would you please help me address this issue?

My code is shown below.

@tvm.script.ir_module
class MyModule:
    @T.prim_func
    def main(a: T.handle, b: T.handle, c: T.handle):
        T.func_attr({"global_symbol": "main", "tir.noalias": True})
        A = T.match_buffer(a, [128, 128])
        B = T.match_buffer(b, [128, 128])
        C = T.match_buffer(c, [128, 128])
        
        for i, j, k in T.grid(128, 128, 128):
            with T.block("update"):
                vi, vj, vk= T.axis.remap("SSR", [i, j, k])
                C[vi, vj] = C[vi, vj] + A[vi, vk] * B[vj, vk]

ir_module = MyModule
sch = tvm.tir.Schedule(ir_module)
block_b = sch.get_block("update")
(i,j,k) = sch.get_loops(block_b)
sch.bind(i,"threadIdx.x")

junrushao · February 6, 2022, 8:44am

@RenyuanLIU You probably forgot the init block. CC: @Hzfengsy

Hzfengsy · February 6, 2022, 1:26pm

Hi @RenyuanLIU. A reduction block contain both init part and update part. The full block should be

with T.block():
    vi, vj, vk= T.axis.remap("SSR", [i, j, k])
    with T.init():
        C[vi, vj] = 0
    C[vi, vj] = C[vi, vj] + A[vi, vk] * B[vj, vk]

tqchen · February 6, 2022, 1:45pm

Those seems to be great lessons that we can use to enhance tutorials and docs!

RenyuanLIU · February 8, 2022, 4:33pm

Thank you very much！

RenyuanLIU · February 12, 2022, 7:09am

Hi @Hzfengsy . Thank you for your help last time.

I am now trying to test the matrix multiplication C = A*B with different matrix sizes of A and B. if it is possible to write something like:

A = T.match_buffer(a, [N,N])

And every time I get A and B, I assign the size to N?

I found that you said tensorIR did not support dynamic buffer now. But for this case, every time I get A and B, the N is determined. So I’m wondering is there any solution to this problem.

Thanks

Hzfengsy · February 13, 2022, 3:34am

Please see an API called specialize

RenyuanLIU · February 18, 2022, 1:42pm

Thank you very much! I appreciate it.

RenyuanLIU · March 8, 2022, 9:44pm

Hi, TVM. I am now confused about the cache_read/cache_write. When I wanted to allocate a shared/local memory for CUDA, I could not define the size of the shared/local memory.

For example:

    @tvm.script.ir_module
    class MyModule:
          @T.prim_func
          def main(a: T.handle, b: T.handle, c: T.handle):

                T.func_attr({"global_symbol": "main", "tir.noalias": True})
                A = T.match_buffer(a, [128, 128])
                B = T.match_buffer(b, [128, 128])
                C = T.match_buffer(c, [128, 128])
    
                for io in T.thread_binding(0,16, thread = "threadIdx.y"):
                      for jo in T.thread_binding(0,16,thread = "threadIdx.x"):
                            for i,j,k in T.grid(8,8,128):
                                  with T.block("C"):
                                        vi, vj, vk= T.axis.remap("SSR", [i, j, k])
                                        C[vi, vj] = C[vi, vj] + A[vi, vk] * B[vj, vk]
    func = MyModule
    sch = tir.Schdule(func)
    C_local = sch.cache_write(c,0,"local")

Here I would like to set the C local memory size as 8*8, but it was automatically set as 128*128, the size of the entire C.

Would you please help me with this issue?

yzh119 · March 8, 2022, 11:12pm

@RenyuanLIU There is a pass CompactBufferAllocation that makes the allocated buffer region smaller. See the usage here: https://github.com/apache/tvm/blob/174d09ee2cef1ea2caab4c84e0bd58d90c09178f/tests/python/unittest/test_tir_transform_compact_buffer_region.py

RenyuanLIU · March 9, 2022, 2:30am

@ yzh119 Thank you for that! I’ll try.

RenyuanLIU · March 11, 2022, 11:23am

Hi TVM.

I’m now still trying to allocate a local memory for matrix multiplication. I found that the produced CUDA code for the T.init() was an if statement, and it checked whether the reduce axis is 0. For example:

@T.prim_func
def matmul(
    A: T.Buffer[(512, 512), "float32"],
    B: T.Buffer[(512, 512), "float32"],
    C: T.Buffer[(512, 512), "float32"],
) -> None:
    # function attr dict
    T.func_attr({"global_symbol": "main", "tir.noalias": True})
    # body
    # with T.block("root")
    for i0, i1, i2 in T.grid(512, 512, 512):
        with T.block("C"):
            i, j, k = T.axis.remap("SSR", [i0, i1, i2])
            # T.reads(C[i, j], A[i, k], B[k, j])
            # T.writes(C[i, j])
            with T.init():
                C[i, j] = T.float32(0)
            C[i, j] = C[i, j] + A[i, k] * B[k, j]

func = matmul
sch = tir.Schedule(func, debug_mask="all")
c = sch.get_block("C")
c_local = sch.cache_write(c, 0, "local")


i, j, k = sch.get_loops(c)
sch.bind(i,"threadIdx.x")
sch.reverse_compute_at(c_local, i)

And the CUDA code is:

extern "C" __global__ void __launch_bounds__(512) main_kernel0(float* __restrict__ A, float* __restrict__ B, float* __restrict__ C) {
float C_local[512];
for (int i1 = 0; i1 < 512; ++i1) {
    for (int i2 = 0; i2 < 512; ++i2) {
        if (i2 == 0) {
            C_local[i1] = 0.000000e+00f;
    }
        C_local[i1] = (C_local[i1] + (A[((((int)threadIdx.x) * 512) + i2)] * B[((i2 * 512) + i1)]));
    }
}
for (int ax0 = 0; ax0 < 512; ++ax0) {
    C[((((int)threadIdx.x) * 512) + ax0)] = C_local[ax0];
}
}

I am wondering if it is possible to make the initialization outside of the nested for loop? So there is no need to run the if statement every time. Like this:

extern "C" __global__ void __launch_bounds__(512) main_kernel0(float* __restrict__ A, float* __restrict__ B, float* __restrict__ C) {
float C_local[512];


for (int i1 = 0; i1 < 512; ++i1) {
    C_local[i1] = 0.000000e+00f;
}
for (int i1 = 0; i1 < 512; ++i1) {
    for (int i2 = 0; i2 < 512; ++i2) {
        C_local[i1] = (C_local[i1] + (A[((((int)threadIdx.x) * 512) + i2)] * B[((i2 * 512) + i1)]));
    }
}
for (int ax0 = 0; ax0 < 512; ++ax0) {
    C[((((int)threadIdx.x) * 512) + ax0)] = C_local[ax0];
}
}

I did not find a TIR API to do this. Would you please help me with this issue? Thank you in advance.

yzh119 · March 11, 2022, 9:34pm

@RenyuanLIU the decompose_reduction API is what you need.

RenyuanLIU · March 12, 2022, 3:47pm

@yzh119 Thank you very much for that!

RenyuanLIU · March 27, 2022, 1:41am

Hi TVM,

I met a problem when I tried to allocate a matrix indexed by another matrix to a shared/local memory. For example:

@tvm.script.ir_module
class MyModule:
    @T.prim_func
    def main(a: T.handle, b: T.handle, c: T.handle, d: T.handle):
        T.func_attr({"global_symbol": "main", "tir.noalias": True})
        A = T.match_buffer(a, [128, 128], "float32")
        B = T.match_buffer(b, [128, 128], "int32")
        C = T.match_buffer(c, [128, 1], "float32")
        
        with T.block("test"):
            for i, j in T.grid(128,128):
                with T.block("C"):
                    vi, vj = T.axis.remap("SR", [i,j])
                    with T.init():
                        C[vi,0] = 0.
                    C[vi,0] += A[B[vi,vj],B[vi,vj]]
                    
ir_module = MyModule
sch = tvm.tir.Schedule(ir_module)

c = sch.get_block("C")
i, j= sch.get_loops(c)
ib, io, ii = sch.split(i, factors = [None, 16, 8])
jo, ji = sch.split(j, factors = [None, 4])

sch.reorder(ib,io,jo,ji,ii)
sch.bind(ib, "blockIdx.x")
sch.bind(io, "threadIdx.x")

C_local = sch.cache_write(c, 0, "local")
sch.reverse_compute_at(C_local, io)

A_local = sch.cache_read(c, 1, "local")
sch.compute_at(A_local, jo)

It returned an error, “Cannot match type handle vs int32”.

I’m wondering whether TIR supports splitting a matrix indexed by another one and allocate it? Thank you in advance.