Fusion is conflict with linear detection?

were · March 4, 2020, 12:43am

In affined loops, linear detection is a very important analysis. Consider a simple index expression like this: a*i + b*j + c*k + d, the linear detection analysis is able to extract the coefficient of each loop variable: [a, b, c, d].

On the other hand, fusion is a very import prerequisite of parallelizing loops. For example, when it comes to Conv with size of H/W, they are often fused and annotated with parallel.

However, when these two comes together, fusion totally kills linear detection. Say if we fuse i and j, then we have (ij/jext)*a+(ij%jext)*b+c*k+d (ij is the fused inductive variable, and jext is the old trip count of loop j). Because of / and % the expression is no longer linear, some optimization relies on this feature will be disabled.

Is it possible to make something like postponed fusion — when making the very vanilla loop nests, loops to be fused are still themselves; after some transformation that requires linear detection is done, we fuse those loop levels.

ANSHUMAN.TRIPATHY · March 4, 2020, 5:35am

Hi, your issue is interesting. I would like to analyze for it. Would you please provide some small reproducible code and pointer to the exact issue you faced. Thanks!

were · March 4, 2020, 5:42pm

import tvm

a = tvm.placeholder((16, 16),)
b = tvm.placeholder((16, 16),)
c = tvm.compute((16, 16), lambda x, y: a[x, y] + b[x, y])

x, y = c.op.axis
vanilla = tvm.create_schedule(c.op)
ir = tvm.lower(vanilla, [a, b, c], simple_mode=True)
print(ir)
x_var, y_var = ir.body.loop_var, ir.body.body.loop_var
print(tvm.arith.DetectLinearEquation(ir.body.body.body.index, [x_var, y_var]))


fused = tvm.create_schedule(c.op)
xo, xi = fused[c].split(x, 4)
fused[c].reorder(xo, y, xi)
xy = fused[c].fuse(xo, y)
ir = tvm.lower(fused, [a, b, c], simple_mode=True)
print(ir)
x_var, y_var = ir.body.loop_var, ir.body.body.loop_var
print(tvm.arith.DetectLinearEquation(ir.body.body.body.index, [x_var, y_var]))

produce compute {
  for (x, 0, 16) {
    for (y, 0, 16) {
      compute[((x*16) + y)] = (placeholder[((x*16) + y)] + placeholder[((x*16) + y)])
    }
  }
}

[16, 1, 0]
produce compute {
  for (x.outer.y.fused, 0, 64) {
    for (x.inner, 0, 4) {
      compute[(((floordiv(x.outer.y.fused, 16)*64) + (x.inner*16)) + floormod(x.outer.y.fused, 16))] = (placeholder[(((floordiv(x.outer.y.fused, 16)*64) + (x.inner*16)) + floormod(x.outer.y.fused, 16))] + placeholder[(((floordiv(x.outer.y.fused, 16)*64) + (x.inner*16)) + floormod(x.outer.y.fused, 16))])
    }
  }
}

[]

were · March 4, 2020, 5:42pm

Can you see the example above?

ANSHUMAN.TRIPATHY · March 5, 2020, 7:42am

Thanks a lot! I am analyzing the issue. Will post my findings.

ANSHUMAN.TRIPATHY · March 6, 2020, 1:11pm

Hi Jian Weng, I have checked the code you shared. I understand your concern now. But i wonder whether such scenario exists in real time.

The complex index accessing logic is embedded only because you have reordered the partitioned loop variable here.

fused[c].reorder(xo, y, xi)

I experimented with few combinations in your code and also for Conv layer. My observation is if the fusion is between immediate neighbouring loop variables(unordered), then tvm.arith.DetectLinearEquation() can detect the co-efficients perfectly fine, and i believe that is what will be the case normally.

If you want i can share you my test codes as well.

But however your scenario looks quite special, where non-immediate neighbouring loop variables get fused, so the indexing gets complex logic.

May be because of my limited knowledge i am unable to visualize the real time user scenarios for such cases.

If you dont mind, can you please elighten me with this point. Thanks a lot!