Feedback on TVM port to custom accelerator

areusch · March 30, 2021, 3:57am

Thanks for the post! Some thoughts:

Right now a lot of calls to the HWlib are very inefficient, as they require a lot of data reformatting on the RISC-V before being accessible to the accelerator. It is weird/annoying that the data layout already gets specified from Relay, we would probably need to insert a data layout (TIR?) optimization pass along the computation graph at some point there.

and

Our accelerator supports int8, but also int4 and int2. At some point we will probably need to look into the Bring your own datatype framework, but we also still need to look into quantization support in TVM. Any recommended reference work would be very useful here!

Tagging @jwfromm in case he knows more here.

We have looked into using BYOC, but we felt like this was a very direct mapping of Relay to instructions, which bypasses a lot of scheduling/optimization magic (Tensor Expressions, AutoTVM) from the rest of the TVM stack. It also did not seem like a very scalable solution to us, since it seems like we would have to map a lot of Relay instructions directly to a HWLib function call, which we also have to develop ourselves.

Is tensorization an option here, or do you need to do more with the TIR after schedule generation?

We have looked into VTA, but VTA is quite different from our platform. We don’t have a fully fledged workstation host device at hand, apart from the bare metal microcontroller. Also we would like to compile as much as possible statically and AoT, and not in a JIT-fashion. Maybe there are some accelerator specific parts we can reuse though. If someone can share their experience on reusing some of this work that would be very insightful!

This is an area I’m quite interested in, but we haven’t done anything on this I know of.

Some functions of the HWlib require parameters that have to be set during compilation based on the weights. It is not clear to us how this fits in with the rest of the compilation stack. Could this be implemented in a TIR pass for example?

It seems like you could have a TIR pass that replaces free variables with constants after doing that computation.

Also tagging @tqchen who may have some more ideas of related work here.

Andrew