TIR - Parallel primitive

Hello,

we try to implement multi core behavior in TensorIR schedule however following error occures using parallel primitive in the loop dimension.

Expected behavior

Distribute loops over multiple compute units(Cores) in parallel.

Actual behavior

Error message: The queried subtree root tir.For#0 in SRef tree does not have compact dataflow, because its child block tir.Block#1 on SRef tree is neither a local complete block nor a local reduction block.

Environment

tvm - 0.9.dev0

Steps to reproduce

    @tvm.script.ir_module
    class Convolution:
    @T.prim_func
    def main(inpt: T.handle, kernl: T.handle, reslt: T.handle):
    T.func_attr({"global_symbol": "main", "tir.noalias": True})
    input = T.match_buffer(inpt, (10, 3, 128, 128),  "float32")
    kernel = T.match_buffer(kernl, (2, 3, 3, 3), "float32")
    result = T.match_buffer(reslt, (10, 2, 124, 124), "float32")
    result_compute = T.match_buffer(reslt, (10, 2, 124, 124), "float32")

    for b, o, h, w in T.grid(10, 2, 124, 124):
        for kc, kh, kw in T.grid(3, 3, 3):
            with T.block("compute"):
                b, o, h, w, kc, kh, kw = T.axis.remap("RRRRRRR", [b, o, h, w, kc, kh, kw])
                result_compute[b, o, h, w] += input[b, kc, h+kh, w+kw] * kernel[o, kc, kh, kw]

    for b, o, h, w in T.grid(10, 2, 124, 124):
        with T.block("result"):
           vb = T.axis.reduce(10, b)
           vc_o = T.axis.reduce(2, o)
           vh = T.axis.reduce(124, h)
           vw = T.axis.reduce(124, w)
           result[vb, vc_o, vh, vw] = result_compute[vb, vc_o, vh, vw]
    written_ir = Convolution
    sch = tvm.tir.Schedule(written_ir)

     b_i, o_i, h_i, w_I, kc_i, kh_i, kw_i = sch.get_loops(sch.get_block("compute"))
     sch.parallel(b_i)

Best regards