Thanks for asking!
To be clear, it is not necessary to break our block isolation when using compute-at. For example, after compute-at, the IR may become:
for i in tir.range(0, 128):
for j in tir.range(0, 128):
with tir.block([128, 128], "A_Block") as [vi, vj]:
A[vi, vj] = tir.float32(0)
with tir.block([128, 128], "B_Block") as [vi, vj]:
B[vi, vj] = A[vi, vj]