Edge cases when using TIR intrinsics

Necrotos · December 5, 2023, 2:53pm

Hi, I have encounter two “edge” cases when working with intrins and would like to ask for guidance on how to adress them:

In my compute intrinsic, when I use sample_perfect_tile, sometimes the innermost factor becomes 1 which results in problems with the described intrinsics as they use a variable to describe the amount of elements they acces, e.g. T.reads(A[0:dim_i]), which doesn’t work when dim_i is 1. Can I adress this somehow?
When tensorizing the data movement instructions, I tried to manually set the tiling factors. However, sometimes that leads to imperfect splitting and introduces a T.where in my schedule. Is there a way to handle this when describing the intrinsics or is it best to switch to sample_perfect_tile here?

Thanks in advance!

Necrotos · January 20, 2024, 11:17pm

Using tensorization when one of the loops has an extent of one is still something I haven’t bee able to figure out a solution for. Does anyone have an idea?

Necrotos · January 24, 2024, 3:33pm

I’ve tracked it down to the behaviour of preserve_unit_iters in blockize_tensorize.cc in the DeriveBlockBinding function. The unit iterator is preserved, but outside the generated block. If I then try to tensorize the block, there is one less iterator in there than I would expect, which causes the issue.

I tried a naive solution to this behaviour of just adding a unit iterator inside the block. This would result in one unit iterator outside the block and one inside. But this doesn’t seem to be the correct way to handle it, as it also results in an error:

Error message: The stmt tir.Block#0 doesn't match the tensor intrin
The pattern attempting to be matched:
with T.block("a_in_local.spad", no_realize=True):
    v0_i = T.axis.spatial(1)
    v1_i = T.axis.spatial(16)
    a_in = T.Buffer((1, 1024), "int8")
    v1_o = T.int32()
    T.reads(a_in[0, v1_o * 16 + v1_i])
    a_in_local_spad = T.Buffer((1, 1024), "int8", scope="local.spad")
    T.writes(a_in_local_spad[0, v1_o * 16 + v1_i])
    a_in_local_spad[0, v1_o * 16 + v1_i] = a_in[0, v1_o * 16 + v1_i]
Does not match the tensorize description:
with T.block("", no_realize=True):
    vr = T.axis.spatial(1)
    vc = T.axis.spatial(16)
    Src = T.Buffer((1, 16), "int8", offset_factor=1)
    T.reads(Src[vr, vc])
    Dest = T.Buffer((1, 16), "int8", scope="local.spad", offset_factor=1)
    T.writes(Dest[vr, vc])
    Dest[vr, vc] = Src[vr, vc]
CompareBufferRegion buffer region min mismatch. lhs->region[i + offset]=I.Range(0, 1) vs rhs->region[i]=range(min=vr, ext=1)Range(0x5586bbd36b70)
BlockNode write buffers do not match: op->writes=[a_in_local_spad[0, v1_o * 16 + v1_i]] vs rhs->writes=[Dest[vr, vc]]

Interestingly, this also seems to be the error I ran into when trying to tensorize a conv2d operator.