Understanding "Output must be attached at root" error

I am getting the following error and I am trying to understand exactly what is going on.

Check failed: stage.GetAttachSpec()->attach_type == kGroupRoot (4 vs. 1) : Output must be attached at root

I have the following schedule:

@main = primfn(placeholder_2: handle, placeholder_3: handle, res_1: handle) -> ()
  attr = {"from_legacy_te_schedule": True, "global_symbol": "main", "tir.noalias": True}
  buffers = {res: Buffer(res_2: Pointer(uint8), uint8, [1, 16, 16], []),
             placeholder_1: Buffer(placeholder_4: Pointer(uint8), uint8, [16, 16], []),
             placeholder: Buffer(placeholder_5: Pointer(uint8), uint8, [16, 16], [])}
  buffer_map = {placeholder_2: placeholder, placeholder_3: placeholder_1, res_1: res} {
   {
    for (x_o.c.outer: int32, 0, 2) {
      for (y_o.c.outer: int32, 0, 2) {
        for (x_o.c.inner.init: int32, 0, 8) {
          for (y_o.c.inner.init: int32, 0, 8) {
            res.local.accumulator: Pointer(local.accumulator uint8)[((((x_o.c.outer*128) + (x_o.c.inner.init*16)) + (y_o.c.outer*8)) + y_o.c.inner.init)] = 0u8
          }
        }
        for (k_o.outer: int32, 0, 2) {
          for (ax0: int32, 0, 8) {
            for (ax1: int32, 0, 8) {
              placeholder.local.scratchpad: Pointer(local.scratchpad uint8)[((ax0*8) + ax1)] = (uint8*)placeholder_5[((((x_o.c.outer*128) + (ax0*16)) + (k_o.outer*8)) + ax1)]
            }
          }
          for (ax0_1: int32, 0, 8) {
            for (ax1_1: int32, 0, 8) {
              placeholder.local.scratchpad_weight: Pointer(local.scratchpad_weight uint8)[((ax0_1*8) + ax1_1)] = (uint8*)placeholder_4[((((y_o.c.outer*128) + (ax0_1*16)) + (k_o.outer*8)) + ax1_1)]
            }
          }
          for (x_o.c.inner: int32, 0, 8) {
            for (y_o.c.inner: int32, 0, 8) {
              for (k_o.inner: int32, 0, 8) {
                res.local.accumulator[((((x_o.c.outer*128) + (x_o.c.inner*16)) + (y_o.c.outer*8)) + y_o.c.inner)] = ((uint8*)res.local.accumulator[((((x_o.c.outer*128) + (x_o.c.inner*16)) + (y_o.c.outer*8)) + y_o.c.inner)] + ((uint8*)placeholder.local.scratchpad[((x_o.c.inner*8) + k_o.inner)]*(uint8*)placeholder.local.scratchpad_weight[((y_o.c.inner*8) + k_o.inner)]))
              }
            }
          }
        }
      }
    }
    for (x_o: int32, 0, 16) {
      for (y_o: int32, 0, 16) {
        res_2[((x_o*16) + y_o)] = (uint8*)res.local.accumulator[((x_o*16) + y_o)]
      }
    }
  }
}

I need to move the last for loop, the one that assigns res_2, to the y_o.c.outer loop. The problem is, when I do this using compute_at, I am getting that error. Is there any workaround to achieve this?

The main idea is that this schedule is tiling a simple matrix multiplication, and I need to get the accumulated output from the accelerators local memory to the global memory before starting another y_o.c.outer loop.

Thanks!

Hi, I am working on getting the same functionality in TE, it looks like it is possible with TIR however I could not find a workaround in TE.