Feedback on TVM port to custom accelerator

Hi everyone,

In this post I would like to ask some feedback on a research project currently going on for retargetting TVM for our custom developed RISC-V microcontroller with accelerator. I can not share too many details about the hardware since it is still in development. The goal is to use the platform in a ultra-low power embedded setting.

The hardware is quite specialized and constrained and it is not always obvious how to optimally map neural network graphs onto it. We found it quite challenging to find which parts of the TVM stack were already useful/repurposable for our purposes, that is why we are reusing bits and pieces on various places.

Along with the development of porting the TVM-stack to our platform, we are also making an OpenCL-like library (the HWlib) to provide a programming interface to the accelerator. In TVM we registered our own code generation backend based off of the C backend. We also made a custom AOT compiler program since this is not currently available in (micro)TVM and the overhead of a more dynamic runtime seemed unnecessary complex for our purposes. It basically deserializes the JSON graph that comes out of the Relay compilation step and puts all of the compiled TVM function library calls in a big C main file and does some tensor (de)allocation as well. It also dumps the weights in a C header file. Our microcontroller does not run an operating system, also because we think this provides unnecessary overhead for the applications we target.

Right now our current development flow looks like this:

  1. We test a single operator from Relay (e.g. conv2d)
  2. We try to adapt a relay op strategy from the generic strategy
  3. In the relay strategy we try to tensorize as much of the operator as possible, by allocating as much of the computation as possible to a HWlib function call
  4. We put the JSON, TVM function library and weights that come out of compilation in our own AoT compiler program.
  5. We put the C files in the adapted GCC compiler for the RISC-V microcontroller we are using. Our own AoT compiler program also makes sure the C code compiles with this GCC compiler.

In this way the outputted C code is run as much as possible on the accelerator, and parts that the accelerator does not support are compiled to the RISC-V core, provided by the generic relay strategy. Currently we are facing some challenges though, for which we would like your comments/opinions/recommendations:

  • Right now a lot of calls to the HWlib are very inefficient, as they require a lot of data reformatting on the RISC-V before being accessible to the accelerator. It is weird/annoying that the data layout already gets specified from Relay, we would probably need to insert a data layout (TIR?) optimization pass along the computation graph at some point there. It’s also not always clear to us which parts of those calls are best offloaded to the hardware library, or which parts (like input padding, data layout transformation) can also be known/provided by TVM.

  • Our accelerator supports int8, but also int4 and int2. At some point we will probably need to look into the Bring your own datatype framework, but we also still need to look into quantization support in TVM. Any recommended reference work would be very useful here! It seems quite challenging that quantization is something that lives across the entire compilation framework, even already from before deployment.

  • We have looked into using BYOC, but we felt like this was a very direct mapping of Relay to instructions, which bypasses a lot of scheduling/optimization magic (Tensor Expressions, AutoTVM) from the rest of the TVM stack. It also did not seem like a very scalable solution to us, since it seems like we would have to map a lot of Relay instructions directly to a HWLib function call, which we also have to develop ourselves.

  • We have looked into VTA, but VTA is quite different from our platform. We don’t have a fully fledged workstation host device at hand, apart from the bare metal microcontroller. Also we would like to compile as much as possible statically and AoT, and not in a JIT-fashion. Maybe there are some accelerator specific parts we can reuse though. If someone can share their experience on reusing some of this work that would be very insightful!

  • Some functions of the HWlib require parameters that have to be set during compilation based on the weights. It is not clear to us how this fits in with the rest of the compilation stack. Could this be implemented in a TIR pass for example?

We really want this to work so our hardware can be used in other, more applied projects. Nevertheless I should admit that the development currently has been anything but straightforward, it is for example not always clear what part of an optimization step we should implement where (Input network graph, Relay, HWLib, TE,…?). This means that currently a lot of development happens very vertically across the entire stack, which makes it difficult to divide work in the project group or to get started with development on new issues in the first place. This issue seems to be a true for most (embedded) DL compilation stacks though.

We hope that fellow developers in the community can share their thoughts and experiences on these issues and what they think about our approach. If you need more details from my part to understand our flow, please let me know.

Thank you all very much!

Best regards,

Josse

3 Likes

Hi @JosseVanDelm ,

Thanks for the post! Some thoughts:

Right now a lot of calls to the HWlib are very inefficient, as they require a lot of data reformatting on the RISC-V before being accessible to the accelerator. It is weird/annoying that the data layout already gets specified from Relay, we would probably need to insert a data layout (TIR?) optimization pass along the computation graph at some point there.

and

Our accelerator supports int8, but also int4 and int2. At some point we will probably need to look into the Bring your own datatype framework, but we also still need to look into quantization support in TVM. Any recommended reference work would be very useful here!

Tagging @jwfromm in case he knows more here.

We have looked into using BYOC, but we felt like this was a very direct mapping of Relay to instructions, which bypasses a lot of scheduling/optimization magic (Tensor Expressions, AutoTVM) from the rest of the TVM stack. It also did not seem like a very scalable solution to us, since it seems like we would have to map a lot of Relay instructions directly to a HWLib function call, which we also have to develop ourselves.

Is tensorization an option here, or do you need to do more with the TIR after schedule generation?

We have looked into VTA, but VTA is quite different from our platform. We don’t have a fully fledged workstation host device at hand, apart from the bare metal microcontroller. Also we would like to compile as much as possible statically and AoT, and not in a JIT-fashion. Maybe there are some accelerator specific parts we can reuse though. If someone can share their experience on reusing some of this work that would be very insightful!

This is an area I’m quite interested in, but we haven’t done anything on this I know of.

Some functions of the HWlib require parameters that have to be set during compilation based on the weights. It is not clear to us how this fits in with the rest of the compilation stack. Could this be implemented in a TIR pass for example?

It seems like you could have a TIR pass that replaces free variables with constants after doing that computation.

Also tagging @tqchen who may have some more ideas of related work here.

Andrew

1 Like

Thanks for your reply @areusch !

Yes, i’m currently trying to use tensorization to map entire convolutions and data preparation steps (data layout, padding) to a HWLib function call, but the process hasn’t been particularly smooth for such coarse computations i’m afraid. Getting data to be transformed from TVM seems suboptimal. Also creating large tensorization intrinsics is tricky; Right now for example it looks like I would have to generate a separate TIR pass, because I can not merge e.g.Relu(Conv(Pad(ChgDataLayout(input)),filter)) into one intrinsic; tensorize/tir does not allow for creating an intrinsic with nested computations The TIR pass i’m envisioning could detect those sequential operations and maybe merge them into one as a workaround for this problem.

I’m not sure how to write a TIR pass yet, but what I would like to do in the future is to maybe skip some data layout transformations automatically. Now the data has to be transformed every time it is sent from and to the accelerator just because most standard convolutions expect NCHW to work in relay for example. We should not be doing data layout transformations if two consecutive operations are performed on the accelerator. I’m not sure if it would be best to implement this as a Relay pass or a TIR pass. If anyone can confirm that this is possible or can send me some work on this that would be very great, as i’ve not had time to look into creating my own pass.

At some point I’d also like to include some autotuning in the grand scheme of things (probably not with actual timing measurements, but rather with a representative model). But I haven’t had time to look into this, and how much effort it would take me to implement this. I’m also afraid the gains of autotuning with coarse tensorization might be quite minimal. But maybe there might be some gains possible for the RISC-V scheduling, i’m not sure.

Okay I’ll be sure to look into this!

Also thank you very much for including other people in the discussion!