Feedback on TVM port to custom accelerator

Hi @UCASHurui welcome to the TVM discuss forums!

Please do not hesitate to ask questions here. Creating a compiler stack for such a platform turned out to be very challenging and I’ve received excellent help on these forums quite a few times.

This post was due to research for my master’s thesis which I finished in June. I can send you the entire thesis over email (send me your address in DM) which gives a detailed overview of our flow and findings if you’d like that.

I think TVM is a very good fit for these kind of projects, but as I’ve stated above already, it still requires a lot of vertical work along the stack which can be daunting sometimes. We also noticed in our research that some inner components of TVM (like tensorization intrinsic support for nested functions mentioned above) are still required to proceed with our research.

As of yet we were not able to use the accelerator hardware yet (we only compiled to the RISC-V core), because our accelerator is coarse (offload an entire convolution), and using TVM’s tensorization primitive was hard for that purpose due to some constraints I already mentioned before. But basically what we tried to do is the following:

  • Load in a network with TVM’s front end (I only tested out the network provided by @areusch in his microtvm blogpost fork as I did not have time to figure out TVMs quantization framework, which is currently being overhauled if i’m correct) however, we only simulated for cycle counts on the processor; we did not specify any camera input nor did we do anything with the actual network results
  • Perform high level optimizations (operator fusion and quantization etc). Already provided by TVM, not altered by us.
  • Offload to the accelerator when possible using a relay op strategy (Relay Operator Strategy — tvm 0.8.dev0 documentation) (We did create a strategy, but finally were not able to offload any operations to the accelerator since the hwlib was not finished, and also since tensorization was difficult. This step will also lower the Relay to TensorIR.
  • lower TIR to C code with C runtime We used the C runtime to generate kernel library for each (possibly fused) operator in the network graph. This TVM-library was called by a C main file generated by our custom AOT tool which reads in the updated (fused) network graph. I cannot share that tool, however a similar tool was shared here.
  • Compile C code with GCC compiler and flash the binary on the device

To elaborate more about the hwlib and hardware resources: The Relay op strategy tests when an operation is offloadable to the accelerator. The Tensorization primitive in TVM alters tensorIR such that operations are offloaded to a single function call. In our case this function call was a C function to the hwlib. You can regard the hwlib as some kind of assembly that manages accelerator memory and triggers operations. In our case a lot of different C calls where needed to prepare the accelerator for a single operation, so we made an abstraction (= the hwlib) that did those C calls in a single function call. I think if you use an ISA extension for your RISC-V core to trigger your accelerator you could also tell TVM to lower to your own RISC-V assembly instead of C code, but I have no experience with this.

A big constraint we had to deal with was that we don’t have a physical device available, and as such had to run everything in RTL simulations, which is very inconvenient and time consuming.

Please let me know if this is clear/helpful! I’d be happy to help you out further!