I am getting the following error and I am trying to understand exactly what is going on.
Check failed: stage.GetAttachSpec()->attach_type == kGroupRoot (4 vs. 1) : Output must be attached at root
I have the following schedule:
@main = primfn(placeholder_2: handle, placeholder_3: handle, res_1: handle) -> ()
attr = {"from_legacy_te_schedule": True, "global_symbol": "main", "tir.noalias": True}
buffers = {res: Buffer(res_2: Pointer(uint8), uint8, [1, 16, 16], []),
placeholder_1: Buffer(placeholder_4: Pointer(uint8), uint8, [16, 16], []),
placeholder: Buffer(placeholder_5: Pointer(uint8), uint8, [16, 16], [])}
buffer_map = {placeholder_2: placeholder, placeholder_3: placeholder_1, res_1: res} {
{
for (x_o.c.outer: int32, 0, 2) {
for (y_o.c.outer: int32, 0, 2) {
for (x_o.c.inner.init: int32, 0, 8) {
for (y_o.c.inner.init: int32, 0, 8) {
res.local.accumulator: Pointer(local.accumulator uint8)[((((x_o.c.outer*128) + (x_o.c.inner.init*16)) + (y_o.c.outer*8)) + y_o.c.inner.init)] = 0u8
}
}
for (k_o.outer: int32, 0, 2) {
for (ax0: int32, 0, 8) {
for (ax1: int32, 0, 8) {
placeholder.local.scratchpad: Pointer(local.scratchpad uint8)[((ax0*8) + ax1)] = (uint8*)placeholder_5[((((x_o.c.outer*128) + (ax0*16)) + (k_o.outer*8)) + ax1)]
}
}
for (ax0_1: int32, 0, 8) {
for (ax1_1: int32, 0, 8) {
placeholder.local.scratchpad_weight: Pointer(local.scratchpad_weight uint8)[((ax0_1*8) + ax1_1)] = (uint8*)placeholder_4[((((y_o.c.outer*128) + (ax0_1*16)) + (k_o.outer*8)) + ax1_1)]
}
}
for (x_o.c.inner: int32, 0, 8) {
for (y_o.c.inner: int32, 0, 8) {
for (k_o.inner: int32, 0, 8) {
res.local.accumulator[((((x_o.c.outer*128) + (x_o.c.inner*16)) + (y_o.c.outer*8)) + y_o.c.inner)] = ((uint8*)res.local.accumulator[((((x_o.c.outer*128) + (x_o.c.inner*16)) + (y_o.c.outer*8)) + y_o.c.inner)] + ((uint8*)placeholder.local.scratchpad[((x_o.c.inner*8) + k_o.inner)]*(uint8*)placeholder.local.scratchpad_weight[((y_o.c.inner*8) + k_o.inner)]))
}
}
}
}
}
}
for (x_o: int32, 0, 16) {
for (y_o: int32, 0, 16) {
res_2[((x_o*16) + y_o)] = (uint8*)res.local.accumulator[((x_o*16) + y_o)]
}
}
}
}
I need to move the last for loop, the one that assigns res_2, to the y_o.c.outer loop. The problem is, when I do this using compute_at, I am getting that error. Is there any workaround to achieve this?
The main idea is that this schedule is tiling a simple matrix multiplication, and I need to get the accumulated output from the accelerators local memory to the global memory before starting another y_o.c.outer loop.
Thanks!