Some architectures can facilitate the LLVM backend with tensorize to do code generation, but would require some extra operations before or after the generated loop program to ensure result correctness. An example of this would be the Gemmini matrix multiplication accelerator, whose execution flow can be embedded into normal LLVM code generation for RISC-V, but requires explicit fence instructions to block the execution flow until data has been fully flushed back to DRAM.
It would be helpful to be able to insert void(void) function calls before and after the generated nested loop program. The calling pair should be able to surround any level of the nested loop program for fine-grain control.
Proposed change
A pair of pragmas, prologue and epilogue, are proposed to support the pattern. The use case would be like the following:
Thanks for the RFC, I think the proposed pragma is reasonable. However, in terms of implementation. It would be great if we do a rewriting pass(like in lower_tvm_intrin) to lower it to call node before we codegen, so we do not need to handle these pragmas in the codegen phase.
It would also be great if you can propose a few alternative API names, so others can pick among choices
For alternative names, I’ve been thinking about some, but chose prologue/epilogue because of the consistency between the two (pro/epi -logue). Some alternatives I’ve considered:
preamble/conclusion
before_body/after_body
pre/post
Regarding the implementation, I’m not familiar with the lower passes (yet), so I just coded a quick one in LLVM codegen. In fact I’ve recently hacked the implementation into CodeGenC as well (I’m playing with MicroTVM), and I agree that this would probably be done at some higher level, but I’d need some assistance.
Couldn’t this be implemented as a custom IR pass (in Python or C++) instead of as a new scheduling primitive? This is essentially taking the body b of a For and replacing it with Block(prologue, b, epilogue) right?
That should be doable as well. I think that should be a Block(prologue, For, epilogue) though, as we still want the loop, not just the body.
However, I’m wondering if this would become a common pattern used for many targets. So far, non-trivial accelerators with the RoCC interface would require this pattern to ensure memory consistency, and I’m anticipating that more heterogeneous SoCs may benefit from this pattern, not just for memory consistency (e.g. enabling a power-hungry device prior to computation and disabling it afterwards).
I agree with @ajtulloch that perhaps it would be helpful to explore if we can do that automatically.
For exmaple, we could write a custom pass that insert necessary memory fence(via a custom pass) when detecting the RW dependencies between the scratch pad and the data(when they corresponds to a different storage scope)
IMHO, you suggested we might require an fence instructions to block the execution flow until data has been fully flushed back to DRAM. Therefore, I’m not quite sure do we really need prologue?
If what we really need is just epilogue pragma, I think barrier might be a better name for it. A typical implement of the barrier looks like
static void __attribute__((noinline)) barrier(int ncores)
{
static volatile int sense;
static volatile int count;
static __thread int threadsense;
__sync_synchronize();
threadsense = !threadsense;
if (__sync_fetch_and_add(&count, 1) == ncores-1)
{
count = 0;
sense = threadsense;
}
else while(sense != threadsense)
;
__sync_synchronize();
}
For my current use case (RoCC accelerators), yes. Actually I do not even need a full-scale barrier between the cores: just an __asm__ volatile("fence"); would be sufficient. Like I’ve expressed in the previous reply, I’m wondering if there can actually be use cases for the prologue part, as it was just so tempting to add that for symmetry with epilogue. Otherwise, I do think something like barrier should be better. We might need to look into the semantics though, as barrier has its very meaning to enforce memory consistency between cores, but epilogue can accept arbitrary void(void) functions.