Feedback on TVM port to custom accelerator

JosseVanDelm · March 27, 2021, 12:33pm

Hi everyone,

In this post I would like to ask some feedback on a research project currently going on for retargetting TVM for our custom developed RISC-V microcontroller with accelerator. I can not share too many details about the hardware since it is still in development. The goal is to use the platform in a ultra-low power embedded setting.

The hardware is quite specialized and constrained and it is not always obvious how to optimally map neural network graphs onto it. We found it quite challenging to find which parts of the TVM stack were already useful/repurposable for our purposes, that is why we are reusing bits and pieces on various places.

Along with the development of porting the TVM-stack to our platform, we are also making an OpenCL-like library (the HWlib) to provide a programming interface to the accelerator. In TVM we registered our own code generation backend based off of the C backend. We also made a custom AOT compiler program since this is not currently available in (micro)TVM and the overhead of a more dynamic runtime seemed unnecessary complex for our purposes. It basically deserializes the JSON graph that comes out of the Relay compilation step and puts all of the compiled TVM function library calls in a big C main file and does some tensor (de)allocation as well. It also dumps the weights in a C header file. Our microcontroller does not run an operating system, also because we think this provides unnecessary overhead for the applications we target.

Right now our current development flow looks like this:

We test a single operator from Relay (e.g. conv2d)
We try to adapt a relay op strategy from the generic strategy
In the relay strategy we try to tensorize as much of the operator as possible, by allocating as much of the computation as possible to a HWlib function call
We put the JSON, TVM function library and weights that come out of compilation in our own AoT compiler program.
We put the C files in the adapted GCC compiler for the RISC-V microcontroller we are using. Our own AoT compiler program also makes sure the C code compiles with this GCC compiler.

In this way the outputted C code is run as much as possible on the accelerator, and parts that the accelerator does not support are compiled to the RISC-V core, provided by the generic relay strategy. Currently we are facing some challenges though, for which we would like your comments/opinions/recommendations:

Right now a lot of calls to the HWlib are very inefficient, as they require a lot of data reformatting on the RISC-V before being accessible to the accelerator. It is weird/annoying that the data layout already gets specified from Relay, we would probably need to insert a data layout (TIR?) optimization pass along the computation graph at some point there. It’s also not always clear to us which parts of those calls are best offloaded to the hardware library, or which parts (like input padding, data layout transformation) can also be known/provided by TVM.
Our accelerator supports int8, but also int4 and int2. At some point we will probably need to look into the Bring your own datatype framework, but we also still need to look into quantization support in TVM. Any recommended reference work would be very useful here! It seems quite challenging that quantization is something that lives across the entire compilation framework, even already from before deployment.
We have looked into using BYOC, but we felt like this was a very direct mapping of Relay to instructions, which bypasses a lot of scheduling/optimization magic (Tensor Expressions, AutoTVM) from the rest of the TVM stack. It also did not seem like a very scalable solution to us, since it seems like we would have to map a lot of Relay instructions directly to a HWLib function call, which we also have to develop ourselves.
We have looked into VTA, but VTA is quite different from our platform. We don’t have a fully fledged workstation host device at hand, apart from the bare metal microcontroller. Also we would like to compile as much as possible statically and AoT, and not in a JIT-fashion. Maybe there are some accelerator specific parts we can reuse though. If someone can share their experience on reusing some of this work that would be very insightful!
Some functions of the HWlib require parameters that have to be set during compilation based on the weights. It is not clear to us how this fits in with the rest of the compilation stack. Could this be implemented in a TIR pass for example?

We really want this to work so our hardware can be used in other, more applied projects. Nevertheless I should admit that the development currently has been anything but straightforward, it is for example not always clear what part of an optimization step we should implement where (Input network graph, Relay, HWLib, TE,…?). This means that currently a lot of development happens very vertically across the entire stack, which makes it difficult to divide work in the project group or to get started with development on new issues in the first place. This issue seems to be a true for most (embedded) DL compilation stacks though.

We hope that fellow developers in the community can share their thoughts and experiences on these issues and what they think about our approach. If you need more details from my part to understand our flow, please let me know.

Thank you all very much!

Best regards,

Josse

areusch · March 30, 2021, 3:57am

Hi @JosseVanDelm ,

Thanks for the post! Some thoughts:

Right now a lot of calls to the HWlib are very inefficient, as they require a lot of data reformatting on the RISC-V before being accessible to the accelerator. It is weird/annoying that the data layout already gets specified from Relay, we would probably need to insert a data layout (TIR?) optimization pass along the computation graph at some point there.

and

Our accelerator supports int8, but also int4 and int2. At some point we will probably need to look into the Bring your own datatype framework, but we also still need to look into quantization support in TVM. Any recommended reference work would be very useful here!

Tagging @jwfromm in case he knows more here.

We have looked into using BYOC, but we felt like this was a very direct mapping of Relay to instructions, which bypasses a lot of scheduling/optimization magic (Tensor Expressions, AutoTVM) from the rest of the TVM stack. It also did not seem like a very scalable solution to us, since it seems like we would have to map a lot of Relay instructions directly to a HWLib function call, which we also have to develop ourselves.

Is tensorization an option here, or do you need to do more with the TIR after schedule generation?

We have looked into VTA, but VTA is quite different from our platform. We don’t have a fully fledged workstation host device at hand, apart from the bare metal microcontroller. Also we would like to compile as much as possible statically and AoT, and not in a JIT-fashion. Maybe there are some accelerator specific parts we can reuse though. If someone can share their experience on reusing some of this work that would be very insightful!

This is an area I’m quite interested in, but we haven’t done anything on this I know of.

Some functions of the HWlib require parameters that have to be set during compilation based on the weights. It is not clear to us how this fits in with the rest of the compilation stack. Could this be implemented in a TIR pass for example?

It seems like you could have a TIR pass that replaces free variables with constants after doing that computation.

Also tagging @tqchen who may have some more ideas of related work here.

Andrew

JosseVanDelm · March 30, 2021, 12:28pm

Thanks for your reply @areusch !

Yes, i’m currently trying to use tensorization to map entire convolutions and data preparation steps (data layout, padding) to a HWLib function call, but the process hasn’t been particularly smooth for such coarse computations i’m afraid. Getting data to be transformed from TVM seems suboptimal. Also creating large tensorization intrinsics is tricky; Right now for example it looks like I would have to generate a separate TIR pass, because I can not merge e.g.Relu(Conv(Pad(ChgDataLayout(input)),filter)) into one intrinsic; tensorize/tir does not allow for creating an intrinsic with nested computations The TIR pass i’m envisioning could detect those sequential operations and maybe merge them into one as a workaround for this problem.

I’m not sure how to write a TIR pass yet, but what I would like to do in the future is to maybe skip some data layout transformations automatically. Now the data has to be transformed every time it is sent from and to the accelerator just because most standard convolutions expect NCHW to work in relay for example. We should not be doing data layout transformations if two consecutive operations are performed on the accelerator. I’m not sure if it would be best to implement this as a Relay pass or a TIR pass. If anyone can confirm that this is possible or can send me some work on this that would be very great, as i’ve not had time to look into creating my own pass.

At some point I’d also like to include some autotuning in the grand scheme of things (probably not with actual timing measurements, but rather with a representative model). But I haven’t had time to look into this, and how much effort it would take me to implement this. I’m also afraid the gains of autotuning with coarse tensorization might be quite minimal. But maybe there might be some gains possible for the RISC-V scheduling, i’m not sure.

Okay I’ll be sure to look into this!

Also thank you very much for including other people in the discussion!

UCASHurui · July 15, 2021, 8:52am

Hi @JosseVanDelm, Thanks for your sharing! You can not imagine how glad I am to see this post! The goal of our research team is very similay to yours which is to develop RISC-V microcontroller with accelerator in a ultra-low power IoT scenario despite that we are still at a very begining stage. I’m still investigating the software stack to establish a proper workflow for our project.

It is very exciting to find out that (micro)TVM is a very suitable tool and there are experienced reseachers like you. Would you elaborate more on your flow, especially the HWlib you created and how does the compiler make use of the hardware resources on the accelerator? Any advices or pointers to external materials will be appreciated. Forgive me if I’m asking naive questions but I’m really looking forward to your reply .

Thank you very much!

Best regards,

Rui

JosseVanDelm · July 15, 2021, 12:24pm

Hi @UCASHurui welcome to the TVM discuss forums!

Please do not hesitate to ask questions here. Creating a compiler stack for such a platform turned out to be very challenging and I’ve received excellent help on these forums quite a few times.

This post was due to research for my master’s thesis which I finished in June. I can send you the entire thesis over email (send me your address in DM) which gives a detailed overview of our flow and findings if you’d like that.

I think TVM is a very good fit for these kind of projects, but as I’ve stated above already, it still requires a lot of vertical work along the stack which can be daunting sometimes. We also noticed in our research that some inner components of TVM (like tensorization intrinsic support for nested functions mentioned above) are still required to proceed with our research.

As of yet we were not able to use the accelerator hardware yet (we only compiled to the RISC-V core), because our accelerator is coarse (offload an entire convolution), and using TVM’s tensorization primitive was hard for that purpose due to some constraints I already mentioned before. But basically what we tried to do is the following:

Load in a network with TVM’s front end (I only tested out the network provided by @areusch in his microtvm blogpost fork as I did not have time to figure out TVMs quantization framework, which is currently being overhauled if i’m correct) however, we only simulated for cycle counts on the processor; we did not specify any camera input nor did we do anything with the actual network results
Perform high level optimizations (operator fusion and quantization etc). Already provided by TVM, not altered by us.
Offload to the accelerator when possible using a relay op strategy (Relay Operator Strategy — tvm 0.8.dev0 documentation) (We did create a strategy, but finally were not able to offload any operations to the accelerator since the hwlib was not finished, and also since tensorization was difficult. This step will also lower the Relay to TensorIR.
lower TIR to C code with C runtime We used the C runtime to generate kernel library for each (possibly fused) operator in the network graph. This TVM-library was called by a C main file generated by our custom AOT tool which reads in the updated (fused) network graph. I cannot share that tool, however a similar tool was shared here.
Compile C code with GCC compiler and flash the binary on the device

To elaborate more about the hwlib and hardware resources: The Relay op strategy tests when an operation is offloadable to the accelerator. The Tensorization primitive in TVM alters tensorIR such that operations are offloaded to a single function call. In our case this function call was a C function to the hwlib. You can regard the hwlib as some kind of assembly that manages accelerator memory and triggers operations. In our case a lot of different C calls where needed to prepare the accelerator for a single operation, so we made an abstraction (= the hwlib) that did those C calls in a single function call. I think if you use an ISA extension for your RISC-V core to trigger your accelerator you could also tell TVM to lower to your own RISC-V assembly instead of C code, but I have no experience with this.

A big constraint we had to deal with was that we don’t have a physical device available, and as such had to run everything in RTL simulations, which is very inconvenient and time consuming.

Please let me know if this is clear/helpful! I’d be happy to help you out further!

andrew_sto · July 16, 2021, 9:30am

Hi,

This is a very interesting thread, thank you very much for posting. I am trying to do a very similar type of work but using a DSP which is targeted in C but also uses intrinsics written in assembly.

The AoT extension that you linked will come very handy since for now I’m only looking at function C code and compiling that in a separate program that I wrote by hand.

Does your CPU and accelerator share memory or do you need to use DMA? Maybe since your accelerated operators are quite “macro” (conv2d) you do the DMA inside the HWlib? In my case I need to handle the DMA in the operator compute strategy. Do you need to handle DMA at all?

What about multi-core? Is this of any concern to you / is that also handled in the HWlib?

Regards,

Andrei

JosseVanDelm · July 19, 2021, 10:26am

Hi Andrei,

To be honest I did more of the high-level work on our software stack, so i’m not super familiar with the actual HWlib and I also have no experience myself in programming a DSP core. But yes indeed I guess the initial plan was to perform everything from memory accesses with a HWlib call so memory access was only performed at the start of every coarse operation (which is something I think we could have still improved/optimized, but haven’t had the time to do so in my thesis).

Our accelerator is based on PULPissimo and features a single-core RISC-V core. The coarse-grained accelerator is configured as a HWPE (hardware processing engine) in this platform. Both the RISC-V core and accelerator have their own L1 memory, but they share an L2 memory bank. Each coarse operation would then load weights and inputs onto the accelerator from the L2 memory at the start of each hwlib call. So we do DMA accesses in some way right now, but probably not in the way that you need for your platform.

Since our platform uses a single core, I have no experience with multicore and the hwlib currently is not targetted towards multicore devices. To be fair I also have no experience in programming multicore devices in general .

Either way, the HWlib is meant to be as lightweight as possible, only trying to abstract what is specific to our device, and try to keep as much code generated by TVM or other general tools.

I hope this answers your question,

Regards,

Josse

MichaelLee · July 20, 2021, 11:40am

Hi, @JosseVanDelm thank you for sharing this post, it’s really helpful! Could you also share a copy of your master’s thesis with me? My email is ltp0709@sina.com, thanks in advance!

areusch · July 20, 2021, 6:34pm

hi @JosseVanDelm @andrew_sto @MichaelLee @UCASHurui

thanks for these great discussions!

@jossevandelm I’m curious if you tried the BYOC flow as well with your accelerator? It’s a blunter tool than you’d like, but it’s another option beyond the tensorization intrinsic.

with regards to DMA: we currently don’t have a way to easily map DMA copies into microTVM now. There is an analogue in the C++ runtime: we map each DLDevice device_type to an implementation of DeviceAPI, and this implementation handles copying memory between the executor memory space and accelerator memory space. it’s likely we’ll need to implement something similar in microTVM/C runtime. we’re currently doing some initial exploration along these lines.

especially for topics like these, where development in tvm/main is at an early stage, I’d like to note we have a bi-weekly microTVM meetup. these are announced on the forum and held over Zoom. feel free to join as you like for higher-bandwidth discussion on these topics.

-Andrew

JosseVanDelm · July 20, 2021, 7:48pm

Thanks for coming back to this post Andrew!

To answer your question; I did not try out the BYOC flow, mainly because we did not have a “library” available to program the accelerator. The way (mainly) the accelerator + (to a lesser extent) RISC-V core still is/was programmed prior to the TVM suggested flow is through a set of ad-hoc python scripts which programs the system’s memories directly. I also picked tensorization over BYOC since my own C++ knowledge is rather limited.

I can see why it would make sense to use BYOC in our case for using the accelerator since it is already coarse-grained. Although in my understanding it would remain an issue to support layers which are not supported by the accelerator (and which should be offloaded to the RISC-V core), since currently no NN layer library is available for PULPissimo’s RISC-V cores, though I believe the people behind PULP are currently working on this. When such a library would be available I think BYOC would be a far more interesting choice than it was a few months ago. But maybe I’m mistaken and waiting for a library is not necessary with BYOC?

And thanks for the reminder for the meetups, I haven’t had time yet to join any, but I would definitely be interested to join one sometime!

Best regards,

Josse

areusch · July 20, 2021, 7:51pm

But maybe I’m mistaken and waiting for a library is not necessary with BYOC?

I’m not sure what performance you’d get without one, but the generic schedule should produce working operator implementations using the c or llvm backends.

andrew_sto · August 13, 2021, 2:17pm

Hi Andrew,

Thanks for the information about DMA. I have however seen that the VTA module includes some DMA functionality by substituting copy semantics that are tagged with a pragma. This seems easier for me to use and seems to be adapted to combining DMA with padding, since this approach detects memory copies at the tir level and then uses InjectCopyIntrin.

My initial approach was to use tensorize for copy te computations, but I’m now adapting the approach in VTA to my needs.

I need to figure out how to make it work to store data directly to the output buffer. Using my tensorize approach for store did not work since the output tensor needs to be attached at the root of the schedule. Thus, a temporary buffer with the same size as the output buffer was created which is undesirable in my case.

Regards,