TIR - Parallel primitive


we try to implement multi core behavior in TensorIR schedule however following error occures using parallel primitive in the loop dimension.

Expected behavior

Distribute loops over multiple compute units(Cores) in parallel.

Actual behavior

Error message: The queried subtree root tir.For#0 in SRef tree does not have compact dataflow, because its child block tir.Block#1 on SRef tree is neither a local complete block nor a local reduction block.


tvm - 0.9.dev0

Steps to reproduce

    class Convolution:
    def main(inpt: T.handle, kernl: T.handle, reslt: T.handle):
    T.func_attr({"global_symbol": "main", "tir.noalias": True})
    input = T.match_buffer(inpt, (10, 3, 128, 128),  "float32")
    kernel = T.match_buffer(kernl, (2, 3, 3, 3), "float32")
    result = T.match_buffer(reslt, (10, 2, 124, 124), "float32")
    result_compute = T.match_buffer(reslt, (10, 2, 124, 124), "float32")

    for b, o, h, w in T.grid(10, 2, 124, 124):
        for kc, kh, kw in T.grid(3, 3, 3):
            with T.block("compute"):
                b, o, h, w, kc, kh, kw = T.axis.remap("RRRRRRR", [b, o, h, w, kc, kh, kw])
                result_compute[b, o, h, w] += input[b, kc, h+kh, w+kw] * kernel[o, kc, kh, kw]

    for b, o, h, w in T.grid(10, 2, 124, 124):
        with T.block("result"):
           vb = T.axis.reduce(10, b)
           vc_o = T.axis.reduce(2, o)
           vh = T.axis.reduce(124, h)
           vw = T.axis.reduce(124, w)
           result[vb, vc_o, vh, vw] = result_compute[vb, vc_o, vh, vw]
    written_ir = Convolution
    sch = tvm.tir.Schedule(written_ir)

     b_i, o_i, h_i, w_I, kc_i, kh_i, kw_i = sch.get_loops(sch.get_block("compute"))

Best regards