I’ve tracked it down to the behaviour of preserve_unit_iters
in blockize_tensorize.cc in the DeriveBlockBinding
function. The unit iterator is preserved, but outside the generated block. If I then try to tensorize the block, there is one less iterator in there than I would expect, which causes the issue.
I tried a naive solution to this behaviour of just adding a unit iterator inside the block. This would result in one unit iterator outside the block and one inside. But this doesn’t seem to be the correct way to handle it, as it also results in an error:
Error message: The stmt tir.Block#0 doesn't match the tensor intrin
The pattern attempting to be matched:
with T.block("a_in_local.spad", no_realize=True):
v0_i = T.axis.spatial(1)
v1_i = T.axis.spatial(16)
a_in = T.Buffer((1, 1024), "int8")
v1_o = T.int32()
T.reads(a_in[0, v1_o * 16 + v1_i])
a_in_local_spad = T.Buffer((1, 1024), "int8", scope="local.spad")
T.writes(a_in_local_spad[0, v1_o * 16 + v1_i])
a_in_local_spad[0, v1_o * 16 + v1_i] = a_in[0, v1_o * 16 + v1_i]
Does not match the tensorize description:
with T.block("", no_realize=True):
vr = T.axis.spatial(1)
vc = T.axis.spatial(16)
Src = T.Buffer((1, 16), "int8", offset_factor=1)
T.reads(Src[vr, vc])
Dest = T.Buffer((1, 16), "int8", scope="local.spad", offset_factor=1)
T.writes(Dest[vr, vc])
Dest[vr, vc] = Src[vr, vc]
CompareBufferRegion buffer region min mismatch. lhs->region[i + offset]=I.Range(0, 1) vs rhs->region[i]=range(min=vr, ext=1)Range(0x5586bbd36b70)
BlockNode write buffers do not match: op->writes=[a_in_local_spad[0, v1_o * 16 + v1_i]] vs rhs->writes=[Dest[vr, vc]]
Interestingly, this also seems to be the error I ran into when trying to tensorize a conv2d operator.