[nnvm] Is there any guide to add a new backend?

Is there any document which describes the structure of a backend and the most necessary part needed?

Thanks a lot

Is there a specific backend you have in mind: CPU, GPUs, custom accelerator, FPGA?

I am thinking of custom accelerator and I have a few questions in mind.

  1. How to add some backend specific graph optimization

  2. How to translate nnvm IR to backend specific IR

  3. I guess it is up to backend runtime to generate backend ISA from backend IR, is that right?

I am trying to find some clue from vta which was just released, could you please give me some pointers?

Thanks

Thank you for the clarification.

  1. What kind of graph optimization are we talking about? Operator fusion, data layout transformation etc?
  2. This will require an operator library implemented with TVM. You’ll need to build the layers of your software stack down from TVM to your hardware programming interface. You can read our tech report (https://arxiv.org/abs/1807.04188) to have an idea of how we did this with VTA. It’s all open-sourced so you can dig into the code to see how we did it.
  3. This is one option, unless you have a compiler. We decided to go with a JIT approach since it was easy to put together.

Thierry

Thanks for your reply,

I am looking for a way to research layer fusion (like this one https://ieeexplore.ieee.org/document/7783725/), this method highly depends on capability and on-chip memory of hardware. So we need to consider backend information when doing graph optimization.

That’s an excellent question, it’s definitely extremely relevant to accelerators.

There are two levels at which you’d want to perform optimizations to make sure that your computation fits within SRAM requirements. First, at the tensor level, blocking/tiling is essential to keep work small enough so it can fit on SRAM.
You can find an example on blocking here: https://docs.tvm.ai/vta/tutorials/matrix_multiply_opt.html#sphx-glr-vta-tutorials-matrix-multiply-opt-py

Next the graph-level optimization is a little trickier, since it’s architecture dependent. For instance, if you were to fuse a fully connected layer with the activation in order to minimize data movement to DRAM, the question is: do you apply matrix multiplication first, store the temporary results in SRAM, then perform activation on that temporary data? Or do it all at once because you have an activation functional unit that can perform a post-processing step right after the GEMM computation, therefore requiring no SRAM buffering.

So the answer to your graph question is: it depends. As of now, we implement optimizations to make sure that everything fits on chip at the level of the tensor. We are actively working on NNVM-level graph support, it’s not 100% automated right now, but we are actively considering the question you just asked.

I hope this helped!

Also to answer your original question, here’s the blog post on the new VTA hardware backend: https://tvm.ai/2018/07/12/vta-release-announcement.html

And our technical report is here: https://tvm.ai/2018/07/12/vta-release-announcement.html

Let us know if you have questions/suggestions.

yes, we do need to consider backend information to do the fusion, but most of the heavy lifting will still be on the code generation part after you fuse the ops, which is what current vta stack do

You can find an example here of how we do schedule automation for 2d convolution to make sure that tiling fits SRAM resource restrictions: https://github.com/dmlc/tvm/blob/master/vta/python/vta/top/vta_conv2d.py#L19

This is a simple solution to explore schedules, apply a filter based on what schedules don’t fit in SRAM, and take the best schedule based on overall data movement to DRAM. For more complex memory hierarchies, I recommend a learning approach, like the one presented in AutoTVM: https://arxiv.org/pdf/1805.08166.pdf