Add support for extern prologue/epilogue functions

KireinaHoro · March 21, 2020, 3:42am

Motivation

Some architectures can facilitate the LLVM backend with tensorize to do code generation, but would require some extra operations before or after the generated loop program to ensure result correctness. An example of this would be the Gemmini matrix multiplication accelerator, whose execution flow can be embedded into normal LLVM code generation for RISC-V, but requires explicit fence instructions to block the execution flow until data has been fully flushed back to DRAM.

It would be helpful to be able to insert void(void) function calls before and after the generated nested loop program. The calling pair should be able to surround any level of the nested loop program for fine-grain control.

Proposed change

A pair of pragmas, prologue and epilogue, are proposed to support the pattern. The use case would be like the following:

s[C].pragma(yo, "prologue", "test_prologue")
s[C].pragma(yo, "epilogue", "test_epilogue")

with corresponding C code

extern "C" int test_prologue() {
	printf("%s invoked\n", __func__);
	return 0;
}

extern "C" int test_epilogue() {
	printf("%s invoked\n", __func__);
	return 0;
}

A more complete example can be found in this gist.

A quick implementation would be https://github.com/apache/incubator-tvm/pull/5050, to directly emit call nodes when doing LLVM codegen if the pragma is detected.

Discussion

Is the form of pragma suitable for expressing this kind of operation?
Are the names prologue and epilogue expressive?
The call nodes are directly emitted in LLVM codegen backend in the pull request. Is this the correct / preferred way to do this?

tqchen · March 25, 2020, 12:00am

Thanks for the RFC, I think the proposed pragma is reasonable. However, in terms of implementation. It would be great if we do a rewriting pass(like in lower_tvm_intrin) to lower it to call node before we codegen, so we do not need to handle these pragmas in the codegen phase.

It would also be great if you can propose a few alternative API names, so others can pick among choices

tqchen · March 25, 2020, 12:01am

cc @liangfu @yzhliu @ajtulloch @vinx13 @thierry would love to know your thoughts

KireinaHoro · March 25, 2020, 1:53am

For alternative names, I’ve been thinking about some, but chose prologue/epilogue because of the consistency between the two (pro/epi -logue). Some alternatives I’ve considered:

preamble/conclusion
before_body/after_body
pre/post

Regarding the implementation, I’m not familiar with the lower passes (yet), so I just coded a quick one in LLVM codegen. In fact I’ve recently hacked the implementation into CodeGenC as well (I’m playing with MicroTVM), and I agree that this would probably be done at some higher level, but I’d need some assistance.

ajtulloch · March 25, 2020, 2:50am

Couldn’t this be implemented as a custom IR pass (in Python or C++) instead of as a new scheduling primitive? This is essentially taking the body b of a For and replacing it with Block(prologue, b, epilogue) right?

KireinaHoro · March 25, 2020, 3:00am

That should be doable as well. I think that should be a Block(prologue, For, epilogue) though, as we still want the loop, not just the body.

However, I’m wondering if this would become a common pattern used for many targets. So far, non-trivial accelerators with the RoCC interface would require this pattern to ensure memory consistency, and I’m anticipating that more heterogeneous SoCs may benefit from this pattern, not just for memory consistency (e.g. enabling a power-hungry device prior to computation and disabling it afterwards).

tqchen · March 25, 2020, 3:47am

I agree with @ajtulloch that perhaps it would be helpful to explore if we can do that automatically.

For exmaple, we could write a custom pass that insert necessary memory fence(via a custom pass) when detecting the RW dependencies between the scratch pad and the data(when they corresponds to a different storage scope)

liangfu · March 25, 2020, 4:33am

IMHO, you suggested we might require an fence instructions to block the execution flow until data has been fully flushed back to DRAM. Therefore, I’m not quite sure do we really need prologue?

If what we really need is just epilogue pragma, I think barrier might be a better name for it. A typical implement of the barrier looks like

static void __attribute__((noinline)) barrier(int ncores)
{
  static volatile int sense;
  static volatile int count;
  static __thread int threadsense;

  __sync_synchronize();

  threadsense = !threadsense;
  if (__sync_fetch_and_add(&count, 1) == ncores-1)
  {
    count = 0;
    sense = threadsense;
  }
  else while(sense != threadsense)
    ;

  __sync_synchronize();
}

, which can be found at https://github.com/riscv/riscv-tests/blob/master/benchmarks/common/util.h#L44 .

KireinaHoro · March 25, 2020, 4:41am

For my current use case (RoCC accelerators), yes. Actually I do not even need a full-scale barrier between the cores: just an __asm__ volatile("fence"); would be sufficient. Like I’ve expressed in the previous reply, I’m wondering if there can actually be use cases for the prologue part, as it was just so tempting to add that for symmetry with epilogue. Otherwise, I do think something like barrier should be better. We might need to look into the semantics though, as barrier has its very meaning to enforce memory consistency between cores, but epilogue can accept arbitrary void(void) functions.