BYOC: Add Verisilicon's NPU support

Summary

Support Versilicon’s NPU with BYOC framework. Bring the TVM ecosystem to my customers.

Motivation

Verislicon’s NPU applied on the edge broadly with many SoC vendors. Meanwhile, more and more customers want to build their application with TVM, so we want to bring this capability for them.

Guide-level explanation

NA

Reference-level explanation

Two major components in our device implementation. Firstly, we implemented code-gen which we learned from ARM ethos-N support. In this part, we visit all the IR nodes and gathered tensor information. Once all the information is extracted from the TVM framework, we create our own model with TIM-VX APIs, then apply layout inference if the original model is from tflite. The layout inference is designed to convert layout from NHWC to NCHW, since our low-level software require NCHW format. Once the graph contructed and converted to the correct layout, we compile it into binary format in mermory, we called it NBG - network binary graph. With this NBG memory, we can deserialize it to dynamic so file by TVM framework.

Class TensorMakerImpl is responsible for gathering tensor information - shape/datatype - for futher tensor creating. Class GraphMakerImpl will create graph/tensor/node with tim-vx apis.

The second part is about to run this NBG file in the runtime. This part is quite simple, we just need to take care of the order for the input and output.

Drawbacks

Rationale and alternatives

With precompiled model into NBG format, it’s easy to deploy in the production environment. If we need to add new operation support, we just need to update the Code-Gen part, there is no update required for the runtime libraries.

Prior art

NA

Unresolved questions

  • Need to add more operation/pattern support in the future.

Future possibilities

Maybe auto-search technical can be applied? for example, if we have mulit-core device and multi-batch application.

1 Like

hey @sven can you add some more details here? some suggested topics:

  • how do you expect users should use your accelerator with TVM (can you provide a small code snippet showing the build flow?)
  • can you explain a bit more about the way you model TIM-VX APIs? perhaps show a small code snippet.
  • are there additional build time requirements added to TVM, and what are they and how should someone obtain them?
  • what challenges did you face in integrating with TVM’s BYOC flow? are there things we should improve?
  • what is the test strategy you would like to take?
  • as mentioned earlier the initial PR is quite large. it would be great if you can split it into pieces for review. could you sketch out the pieces you propose to submit (e.g. graph partitioner/codegen, runtime, then one-operator-at-a-time, or maybe include one op in the initial submission and then the rest as follow-on PR?)

cc @comaniac if there are other details he thinks are important for BYOC RFC

you may also want to have a look at https://github.com/apache/tvm-rfcs/blob/main/0000-template.md for a template/suggested organization.

1 Like

Thanks @areusch I think you have covered pretty much everything. One more word for testing strategy: this is the part I would be focusing more, because testing is always a big issue for BYOC backends due to the lack of compilation environment and real devices on CI. Please be specific about how do you plan to test and what you need to add to the current CI.

1 Like

usage of our accelerator

I’ve some description about the build at here. I can review it and try to make it cleaner.

For the production, end-user can compile their model in the host and distribute the execution across the devices.

TIM-VX APIs

We have programming-guide, please check it. Thanks

Build time requirement

Do you mean if enabled our backend, how much compilation overhead will be introduced?

Challenges with BYOC

  • Document about such code-gen: public doc only described export JSON/C-Source format before. From the very beginning, it can only be learned from ethos-N implementation.
  • It’s difficult to extract the information from the CallNode, we have to try it out with debug.
  • It’s difficult to learn which parameter/attribute is added for the operation. We learned it from the python API. If there is a document, it would be easier.

Test strategy

We can provide an online github-runner instance, actually, we planned to do this with our fork. We’d like plugin a dev board in the github-runner, so that it can be used by github-actions.

Reduce patch size

Sure, I will find some internal resource to handle it. The first patch maybe just bring up only single operation support as you suggested.

@sven it would be great to provide a small overview of how to use your accelerator in “Guide-level explanation.” For instance, based on a cursory review of test_vsi_tflite_model_all.py, the flow in using your accelerator is quite similar to other TVM flows. have you tried using tvmc with it? it may make for a very simple Guide-level explanation :slight_smile:

Some things we are looking for in guide-level explanation:

  • what target string do you use?
  • we assume you compile with tvm.relay.build but please spell this out in addition to any extra passes you may need
  • how does a user compile start the remote RPC server on the target hardware?
  • which target hardware is used as the demo?

regarding TIM-VX APIs:

We have programming-guide, please check it. Thanks

it would be great to link this from the RFC or from README.md. my original question is about src/relay/backend/contrib/vsi_npu/op_map/attribute.h. you define a lot of struct there. can you please explain the approach you’re using to model TIM operators in TVM? it would be great to avoid duplication of this or learn about the different motivations behind these structs so that we can combine the infrastructure in the future. e.g. the Ethos-U BYOC generator also needs to do some modeling of architecture specific calls as well.

regarding build-time requirement:

  • What additional libraries may need to be linked with TVM? where can I download them? how can we add them to the CI docker images, if you want to go that route?

regarding the test strategy:

  • do you have any kind of simulation? currently, we don’t test things in CI which can’t be run on public clouds. the reason is that it’s hard for others who don’t have a physical copy of your hardware to debug a problem they’ve caused with your unit test
  • we prefer to use the ci- docker containers to test anything which can run on linux.
  • we do recognize that this is limiting and want to pursue a strategy to allow us to test against real hardware. However, this strategy wouldn’t strictly stop a PR from being submitted (it is hard for us to tell if a problem occurred with your hardware-in-the-loop test due to the PR or a problem with the test-bench). It could be used as an advisory vote (if your test-bench is fast enough to run every PR) or as a nightly signal. Does this work for you? feel free to provide feedback on this strategy or suggest a better one–this is not yet implemented.

I’ll check internally to add more detail for “Guide-level explanation”. I didn’t try tvmc before.

we use attribute.h to make the conversion from the tvm to tim-vx more clear and explicit. If the TVM or BYOC can provide some native conversion, this will save us from this. For example, convert from tvm::relay::Shape to c++ native datatype std::vector<uint32_t>,this could be helpful because our API is purely C++ interface with native C++ datatypes. Even more, maybe also provide template specialize for the backend if they have their own datatype for the attribute.

We need additional libraries, its our prebulit libraries. Let me add it to the readme later.

The simulation is not ready yet. Our test takes around 10~20 minutes for serval models(mobileV1/2/3,inceptionv1/2/3/4). Does TVM have an operation level test to make sure every backend implementation is compatible with TVM itself? some kind of test similar to android nnapi CTS/VTS.

@sven thanks for your reply.

Could you assemble all of these changes we’ve discussed into an RFC and propose it to apache/tvm-rfcs repository? See RFC Workflow. Once accepted there, we will create a GH tracking issue and can begin merging code. In general, we’ve had a good discussion here so the focus now is just compiling the changes into a coherent RFC that can be used to maintain the code moving forward.

In general our main concern in merging code to begin with is ensuring we can maintain it and not break you guys as development proceeds in TVM. To that end I think it would be good to describe a small test strategy (we can iterate on this as you guys continue to develop test cases) and check-in test cases along with the code.

We unfortunately do not have an operator-level test at this time. It’s something we want to improve. I’d suggest you commit the simulator and add tests for each operator for your backend to tests/python/contrib. If we test mostly at the operator-level with the simulator, do you anticipate the tests will take as long (e.g. which takes longer, setup or simulation)? We will likely move most model-level end-to-end tests to a nightly in the medium term (no RFC has been posted about this yet).

To commit the simulator, you’ll likely need to file a new GH issue against apache/tvm following the “Update Docker CI Image” template. Please include the prebuilt libraries with this update and document the addition in the RFC.

Thanks, I’ll refine our work ASAP.