[pre-RFC][ BYOC] RISC-V CSI-NN2 Compute Library integration

Summary

Introduce CSI-NN2 Compute Library into TVM to accelerate the inference performance of RISC-V CPU with Vector Extension.

Motivation

Recently, in the latest Tiny v0.7 list released by AI benchmark MLPerf. Alibaba’s T-Head XuanTie RISC-V C906 processor has achieved first place in all 4 indicators. So, it’s a good time to support RISC-V CPUs with vector extension in TVM.

CSI-NN2 Compute Library (CSINN2) is an open-source project that provides hand-crafted assembler routines for RISC-V CPUs with vector extension. It is compatible with RISC-V v0.7.1 and v1.0 vector extension instruction standards. This integration will look at how we can accelerate CPU performance for RISC-V devices like XuanTie C906 in TVM using CSINN2. The idea is that by converting operators from a relay graph to CSINN2 we can achieve faster inference times due to these routines. The initial intention is that this will improve performance for FP32 models. Although, with further improvements to the integration this will extend to quantized models and support for a wider range of operators.

PS: If you are interested in XuanTie C906 processor, the D1 development board is a good choice.

Proposal

We have been working on integrating CSINN2 using the BYOC infrastructure. Our current implementation is based on CSourceModule. We provide here is an overview of the flow from compilation to runtime we aim to achieve:

• Front-end graph.

• Lower to relay graph.

• Use the codegen to convert Relay operators to CSINN2. In this stage, we generate a model.c to record layer attributions and CSINN2 Graph representation. And a model.params to save constant tensors (and quantitative information)

• Compile model.c into model.so

CSINN2 runtime module

• Load model.so and model.params.

• Supplying input and output buffers

Building with CSINN2 support

We plan to use QEMU to simulate all things on x86 machine. So, we can conveniently to testing and use the same code on x86 and RISC-V devices. The current implementation has one build option USE_CSINN2 in CMake for whether enable CSINN2 runtime.

We include a script under docker/ubuntu_install_csinn2.sh which pulls CSINN2 from the github repository and makes building CSINN2 cross-compiled for RISC-V easy to use within TVM.

Codegen and compilation

Before codegen, we pre-process the graph that the codegen receives. We use some TVM’s passes like FoldConstant and some custom passes to optimizer the graph. We provide a checker here to determine what is supported on CSINN2. CSINN2 has its own graph representation. So, in this stage, we generate every operator from relay to CSINN2. At the same time, we also generate the graph representation. Both of them will be written in model.c. All constant tensors (and quantitative information) will be saved in model.params. At the end of codegen, we can get a model.params and a model.so which is compiled by model.c.

Runtime Support

We implement a simple runtime. It just supplies input and output buffers, and load model.so and model.params. in this way, the runtime can be replaced by other AIOT applications easily.

Operator Support

Currently, we have implemented about 50 operators (FP32/FP16/quantized). in this RFC, the integration provides support for the following operators using FP32 precision:

• conv2d

• relu

• maxpool2d

• softmax

Further support for a wider range of operators will follow.

Testing

We currently have 2 different types of tests all of which reside under python/contrib/test_csinn2.

test_network.py: The network tests test a network end to end using QEMU, and compare against known good results. (currently mobilenetv1)

test_operatorname.py: The unit tests test the individual operator sequences that can be offloaded to the CSINN2. These tests also run inference in QEMU with random data. This allows end-to-end testing of the TVM integration for each of the supported operators.

Future improvements

The integration in its current form doesn’t add support for most operators in CSINN2, it is mostly a proof of concept. Below is a series of items we hope to add/improve upon soon.

• Support a wider range of operators for FP32 (and FP16).

• Support for quantized operators.

Thanks, any thoughts are appreciated.

2 Likes