Designs like VTA, TPU and many mobile NN accelerators present hardware designs that lack control flow support like handling if conditions. Therefore, it becomes responsibility of the compiler to solve for these ‘if’ conditions at compile time and produce sizable chunks that the hardware can work on. These chunks can then be mapped to hardware ISA using TVM’s tensorize feature.
A simple motivation example is
Inputs variable - A[60], B[60]
Outputs variable - C[60]
Computation - A[i] + B[i]
realize compute([0, 60]) {
produce compute {
for (i, 0, 60) {
compute(i) =(A(i) + B(i))
}
}
}
Suppose, the hardware can only work on 16 elements at a time. We can use split feature and get the following AST
// attr [compute(compute, 0x1873900)] realize_scope = ""
realize compute([0, 60]) {
produce compute {
for (i.outer, 0, 4) {
for (i.inner, 0, 16) {
if (likely(((i.inner + (i.outer*16)) < 60))) {
if (likely(((i.inner + (i.outer*16)) < (60 - 0)))) {
compute((i.inner + (i.outer*16))) =(A((i.inner + (i.outer*16))) + B((i.inner + (i.outer*16))))
}
}
}
}
}
}
And, finally we need to perform loop partitioning solving these ‘likely if’ conditions at compile time. The final output looks like this
// attr [compute(compute, 0x1873900)] realize_scope = ""
realize compute([0, 60]) {
produce compute {
for (i.outer, 0, 3) {
for (i.inner, 0, 16) {
compute(((i.outer*16) + i.inner)) =(A(((i.outer*16) + i.inner)) + B(((i.outer*16) + i.inner)))
}
}
for (i.inner, 0, 12) {
compute((48 + i.inner)) =(A((48 + i.inner)) + B((48 + i.inner)))
}
}
}
Current support - There is Loop partitioning pass that supports this kind of partitioning to some extent. And we have tensorization to perform mapping of a sub-compute loop nest to a HW instruction.
Problem - Here, we need to perform loop partitioning first and then perform Tensorization. However, this is a big problem as the IR has already been lowered to AST for loop partitioning. Whereas, tensorization works on much higher IR level - Schedule data structures.
@tqchen @ziheng @thierry Do you have any ideas to solve this issue? In my opinion, VTA might have already encountered this type of problem.