Internal Error Constant_Size>0

Hi, I wrote a simple matmul program and schedule for the same. I’ve an unknown dimension at compile-time that I’d like to be able to pad so that I can bind the block & thread ids to make it parallel without resorting to if-condtional statements during multiplication.

I don’t how what the error means and how to go about resolving it.

The Original TIR Code.

@I.ir_module
class MatModule:
    @T.prim_func(private=True)
    def main(B_buffer: T.Buffer((T.int64(4096), T.int64(12288)), "float16"), A_handle: T.handle, C_buffer: T.handle):
        T.func_attr({"tir.noalias": T.bool(True)})
        batch_size = T.int64()

        A = T.match_buffer(A_handle, (batch_size, T.int64(4096)), "float16")
        C = T.match_buffer(C_buffer, (batch_size, T.int64(12288)), "float16")

        for i, j, k in T.grid(batch_size, 12288, 4096):
            with T.block("matmul"):
                vi, vj, vk= T.axis.remap("SSR", [i, j, k])
                with T.init():
                    C[vi, vj] = T.float16(0)
                C[vi, vj] = C[vi, vj] + A[vi, vk] * C[vk, vj]

The Schedule.

sch = tir.Schedule(MatModule)
Threads_X = 32; Threads_Y = 8; VecSize = 8; ChunkSize = 128

matmul_blk = sch.get_block("matmul")
sch.pad_einsum(matmul_blk, [Threads_Y, Threads_X*VecSize, ChunkSize])

rmat, wmat = sch.get_producers(matmul_blk)[0], sch.get_consumers(matmul_blk)[0]
m, n, k = sch.get_loops(matmul_blk)
mo, mi = sch.split(m, [None, Threads_Y])
no, ni, nv = sch.split(n, [None, Threads_X, VecSize])
ko, ki = sch.split(k, [None,ChunkSize])
sch.reorder(mo,no,mi,ni,ko,ki,nv)

sch.compute_at(rmat, ko)
sch.reverse_compute_at(wmat, ni)
sch.bind(mo, "blockIdx.x")
sch.bind(no, "blockIdx.y")
sch.bind(mi, "threadIdx.x")
sch.bind(ni, "threadIdx.y")
ro, rv = sch.split(sch.get_loops(rmat)[-1], [None, VecSize])
wo, wv = sch.split(sch.get_loops(wmat)[-1], [None, VecSize])
sch.vectorize(rv)
sch.vectorize(wv)
sch.vectorize(nv)

sch.show()
func = tir.transform.MakePackedAPI()(sch.mod)["main"]
ex = tvm.build(func, target="opencl")

The Error.

InternalError: Traceback (most recent call last):
  8: tvm::runtime::PackedFuncObj::Extractor<tvm::runtime::PackedFuncSubObj<tvm::runtime::TypedPackedFunc<tvm::runtime::Module (tvm::runtime::Map<tvm::Target, tvm::IRModule, void, void> const&, tvm::Target)>::AssignTypedLambda<tvm::__mk_TVM23::{lambda(tvm::runtime::Map<tvm::Target, tvm::IRModule, void, void> const&, tvm::Target)#1}>(tvm::__mk_TVM23::{lambda(tvm::runtime::Map<tvm::Target, tvm::IRModule, void, void> const&, tvm::Target)#1}, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >)::{lambda(tvm::runtime::TVMArgs const&, tvm::runtime::TVMRetValue*)#1}> >::Call(tvm::runtime::PackedFuncObj const*, tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)
  7: tvm::TIRToRuntime(tvm::runtime::Map<tvm::Target, tvm::IRModule, void, void> const&, tvm::Target const&)
  6: tvm::codegen::Build(tvm::IRModule, tvm::Target)
  5: _ZN3tvm7runtime13PackedFun
  4: tvm::runtime::TypedPackedFunc<tvm::runtime::Module (tvm::IRModule, tvm::Target)>::AssignTypedLambda<tvm::runtime::Module (*)(tvm::IRModule, tvm::Target)>(tvm::runtime::Module (*)(tvm::IRModule, tvm::Target), std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >)::{lambda(tvm::runtime::TVMArgs const&, tvm::runtime::TVMRetValue*)#1}::operator()(tvm::runtime::TVMArgs const&, tvm::runtime::TVMRetValue*) const
  3: tvm::codegen::BuildOpenCL(tvm::IRModule, tvm::Target)
  2: tvm::codegen::CodeGenC::AddFunction(tvm::GlobalVar const&, tvm::tir::PrimFunc const&)
  1: tvm::codegen::CodeGenC::VisitStmt_(tvm::tir::AllocateNode const*)
  0: _ZN3tvm7runtime6deta
  File "/mlc-llm/3rdparty/tvm/src/target/source/codegen_c.cc", line 984
InternalError: Check failed: constant_size > 0 (0 vs. 0) : Can only handle constant size stack allocation for now

Please try to cache_read A and B into shared.

The error is because you allocated non constant memory, due to global allocation