Loop partitioning, padding and tensorization

Hi all,

In my effort to accelerate AArch64 through tensorization, I incurred into an issue.

Basically, I am padding my input tensor, to let tensorize work (I need rows to be multiple of 4 and cols to be multiple of 16).

However, bound inference removes padding (since it is not used) and, when I tile the computation, tir.likely statements appear. This results in tensorize producing the error:

TVMError: Tensorize failed, split condition tir.likely(((...)) relies on var defined inside tensorize scope

One solution is to add a (sufficiently complex) zero multiplication by a padding element, to trick the bound inference (see for example here).

However, this is very hacky and it is not supposed to last (as the bound inference gets smarter, it might detect that the added element is zero).

The question is: should we try to come up with a “good” solution for this?

One idea might be to let tensorize accept @tir.likely statements and replace them with a “variable size” tensorization which will be provided by the developer.

For instance, we might add a _intrin_func_variable private function that gets called only when a variable tensorization (i.e., a tensorization over @tir.likely) is needed.

I have also read through this post, but it doesn’t seem to arrive to a concrete solution.

Any ideas?

@anijain2305, @FrozenGene, @mbaret, @tqchen , @ramana-arm

How about using Relay Legalize pass to add an explicit padding at the graph level?

Hi Animesh, The problem is that I need padding added in the middle of TIR on my (transformed) data tensor.

I.e., something like

A1 = im2col(A)
A2 = pad(A1)
C_padded = te.compute([M,N], lambda i, j : sum(A2[i,k]*B[k,j], k)
C = unpad(C)+requantization

Then I tile on C and tensorize on the inner tile (which is where the problem started). Note that I cannot fuse the requantization to the main computation because of the unpad

Also, it would be nice to not pad A at all, but to work on a solution that can automatically detect the borders and invoke different kind of tensorizations (if provided) or use scalar computation for the borders (if multiple tensorizations are not provided).

In this way I don’t need unpadding and the computation could become:

A1 = im2col(A)
C = te.compute([M,N], lambda i, j : sum(A1[i,k]*B[k,j], k)) + requantization #tensorization handles everything automatically

What do you think?

1 Like

Up, I am currently facing the same issue… Have you found any solution?