Hi,
This is a very interesting thread, thank you very much for posting. I am trying to do a very similar type of work but using a DSP which is targeted in C but also uses intrinsics written in assembly.
The AoT extension that you linked will come very handy since for now I’m only looking at function C code and compiling that in a separate program that I wrote by hand.
Does your CPU and accelerator share memory or do you need to use DMA? Maybe since your accelerated operators are quite “macro” (conv2d) you do the DMA inside the HWlib? In my case I need to handle the DMA in the operator compute strategy. Do you need to handle DMA at all?
What about multi-core? Is this of any concern to you / is that also handled in the HWlib?
Regards,
Andrei