Hi, I’m trying to use the recently merged async pipeline([https://github.com/apache/tvm/pull/12171]). Thanks for the excellent work of @masahi.
I have a question of three_stage_compute in test_tir_transform_inject_software_pipeline.py:
@T.prim_func
def three_stage_compute(A: T.Buffer[(16, 16), "float32"], D: T.Buffer[(16, 16), "float32"]):
for tx in T.thread_binding(0, 16, thread="threadIdx.x"):
for i in T.serial(
0,
16,
annotations={
"software_pipeline_stage": [0, 1, 2],
"software_pipeline_order": [0, 1, 2],
},
):
with T.block("compute"):
T.reads(A[tx, i])
T.writes(D[tx, i])
B = T.alloc_buffer((16, 1), dtype="float32", scope="shared")
C = T.alloc_buffer((16, 1), dtype="float32", scope="shared")
with T.block():
T.reads(A[tx, i])
T.writes(B[tx, 0])
B[tx, 0] = A[tx, i] * T.float32(2)
with T.block():
T.reads(B[tx, 0])
T.writes(C[tx, 0])
C[tx, 0] = B[tx, 0] + T.float32(2)
with T.block():
T.reads(C[tx, 0])
T.writes(D[tx, i])
D[tx, i] = C[tx, 0] + T.float32(1)
If I set the software_pipeline_async_stages = [0, 1, 2] which means 3 stages are all async stage. In this case stage 2 “C[tx, 0] = B[tx, 0] + T.float32(2)” wait for stage 1 “B[tx, 0] = A[tx, i] * T.float32(2)”. Stage 1 is a async producer and stage 2 reads from asynchronously written buffers.
I think there is another wait/commit relations between stage 2 and stage 3 which async stage did not consider. The software pipelined tir is like this:
with T.block("_1"):
T.reads(A[tx, 2 : 16], B[0 : 2, tx, 0], C[0 : 2, tx, 0])
T.writes(B[0 : 2, tx, 0], C[0 : 2, tx, 0], D[tx, 0 : 14])
for i in T.serial(14):
with T.block("_2"):
T.where(i + 2 < 16)
T.reads(A[tx, i + 2])
T.writes(B[i % 2, tx, 0])
B[(i + 2) % 2, tx, 0] = A[tx, i + 2] * T.float32(2)
with T.block("_3"):
T.where(i + 2 - 1 < 16)
T.reads(B[(i + 1) % 2, tx, 0])
T.writes(C[(i + 1) % 2, tx, 0])
C[(i - 1 + 2) % 2, tx, 0] = B[(i - 1 + 2) % 2, tx, 0] + T.float32(2)
with T.block("_4"):
T.where(i + 2 - 2 < 16)
T.reads(C[i % 2, tx, 0])
T.writes(D[tx, i])
D[tx, i - 2 + 2] = C[(i - 2 + 2) % 2, tx, 0] + T.float32(1)
Suppose i = 1, “C[0,tx,0] = B[1,tx,0] + T.float32(2)” should wait for the “D[tx, 1] = C[0, tx, 0] + T.float32(1)” from the i-1 iteration. Because stage 2 can use C[0, tx, 0] only after “D[tx, 1] = C[0, tx, 0] + T.float32(1)” is completed.
I don’t know if this requirement is HW dependent. DMA read / compute / DMA write are the three stages in my case. So the compute stage should wait for both the DMA read and DMA write in previous iteration.
Correct me if I misunderstood asynchronous stage in software pipeline. Thanks.