Hello,
we try to implement multi core behavior in TensorIR schedule however following error occures using parallel primitive in the loop dimension.
Expected behavior
Distribute loops over multiple compute units(Cores) in parallel.
Actual behavior
Error message: The queried subtree root tir.For#0 in SRef tree does not have compact dataflow, because its child block tir.Block#1 on SRef tree is neither a local complete block nor a local reduction block.
Environment
tvm - 0.9.dev0
Steps to reproduce
@tvm.script.ir_module
class Convolution:
@T.prim_func
def main(inpt: T.handle, kernl: T.handle, reslt: T.handle):
T.func_attr({"global_symbol": "main", "tir.noalias": True})
input = T.match_buffer(inpt, (10, 3, 128, 128), "float32")
kernel = T.match_buffer(kernl, (2, 3, 3, 3), "float32")
result = T.match_buffer(reslt, (10, 2, 124, 124), "float32")
result_compute = T.match_buffer(reslt, (10, 2, 124, 124), "float32")
for b, o, h, w in T.grid(10, 2, 124, 124):
for kc, kh, kw in T.grid(3, 3, 3):
with T.block("compute"):
b, o, h, w, kc, kh, kw = T.axis.remap("RRRRRRR", [b, o, h, w, kc, kh, kw])
result_compute[b, o, h, w] += input[b, kc, h+kh, w+kw] * kernel[o, kc, kh, kw]
for b, o, h, w in T.grid(10, 2, 124, 124):
with T.block("result"):
vb = T.axis.reduce(10, b)
vc_o = T.axis.reduce(2, o)
vh = T.axis.reduce(124, h)
vw = T.axis.reduce(124, w)
result[vb, vc_o, vh, vw] = result_compute[vb, vc_o, vh, vw]
written_ir = Convolution
sch = tvm.tir.Schedule(written_ir)
b_i, o_i, h_i, w_I, kc_i, kh_i, kw_i = sch.get_loops(sch.get_block("compute"))
sch.parallel(b_i)
Best regards