Please try:
with bb.function("main"):
with bb.dataflow():
gv = bb.add_func(matmul_fp32, "matmul_fp32")
C = bb.emit_output(
relax.call_tir(
gv,
[A, B],
relax.TensorStructInfo((128, 128), "float32"),
)
)
bb.emit_func_output(C, params=[A, B])
Writing whole IRModule with TVMScript also helps