Motivation and Scope
The Arm Ethos-N series is an high throughput, low area neural network processor for ML inference from cloud to edge to endpoint. This processor and software driver stack supports a variety of popular neural networks, including CNNs and RNNs, for classification, object detection, image enhancements, speech recognition and natural language understanding. Arm has recently open-sourced the ethos-n-driver-stack. The intention of this RFC is to integrate the driver stack into TVM so operations supported by the stack can be offloaded to the Ethos-N neural network processors.
Over the past several months, work has been ongoing in the area of graph partitioning. We propose to build on top of this work by defining merge-composite patterns that partition the Relay graph into sections that can be offloaded by the bring-your-own compiler (BYOC) infrastructure. The Ethos-N driver stack provides a compiler front-end, the Support Library (SL), that accepts a graph structure similar to the Relay graph structure. The “compile” phase of the BYOC passes the Relay operators to the SL which builds an internal graph. This graph is then compiled into a command-stream; a description of processing steps required to execute the inference on the Ethos-N processors. The command stream is included in the generated module as a blob.
The packed function that is also generated by the the BYOC infrastructure calls into a runtime Inference function, passing in the command stream. This functions sets up the necessary buffers if required. The command stream is then executed by a driver library included in the ethos-N driver stack.
A conversion needs to take place between the Relay operators and the SL operators, e.g. for tensor descriptors and attributes, and some operations in Relay are combined in the SL. This conversion takes place when the composite functions are processed and handed over to the SL.
TVM supports a larger range of operators that the Ethos-N processor. In order to determine what is supported on Ethos-N, the SL supports an IsSupported() query mechanism. This will be used in the existing “check functions” as implemented in PR 5261.
The integration requires changes in several areas.
The driver stack code can be cloned from the GitHub repository. A build script similar to for example the existing Vulcan support builds the driver stack libraries for use in TVM. The Ethos-N support in TVM can be enabled by adding a path to the USE_ETHOSN configuration variable. This causes the build process to pick up the required header files and libraries and compile-in the support for Ethos-N, enables the relevant tests, and enables the graph partitioning code to detect Ethos-N compatible operations.
Partitioning pattern definitions are created for the operators that are supported in Ethos-N to cause them to be picked up by the graph partitioning code. A layer in between the graph partitioning code translates the graph partitions, the composite functions, from Relay to the Ethos-N compatible formats and adds the converted operators to the Ethos-N support library. The partition is then compiled, resulting in a command stream. The command stream and the constant data (weights), if any, is added to the generated module for this partition.
The packed function that is compiled for each graph partition calls into a packed function in the TVM runtime to do the heavy lifting. It passes in the command stream for the section of the graph it is concerned with, and the input and output tensors. The runtime function sets up buffers using information stored in the command stream and calls into the Ethos-N driver library to execute the inference. The result of the inference is passed back as usual.
There are two sets of tests. The network tests test a network end to end and assume the hardware is available. These test push a network through and compare against known good results. The Ethos-N driver stack is required; the tests will be disabled if this is not available.
The unit tests test the individual operator sequences that can be offloaded to the Ethos-N processor. These tests do not need hardware to run and are enabled when the driver stack is available. They use a small Relay graph as a model, partition this and run an inference with random data. They do this once for the CPU and once for the Ethos-N processor. The results for the CPU are passed into the runtime inference code via a backdoor mechanism. When the actual inference is run through the Ethos flow, these results are passed back, simulating a hardware inference. This allows end-to-end testing of the TVM integration for each of the supported operators.
Build system: cmake/modules/contrib/ethosn.cmake, cmake/util/FindEthosN.cmake
Compiler code: src/relay/backend/contrib/ethosn. Parsing of graph partition, conversion into SL data structures, compile into module.
Runtime code: src/runtime/contrib/ethosn directory. Run an inference given a command stream and input/output tensors.
Unit test code: tests/python/contrib/test_ethosn directory conforming to pytest.
Network test code: tests/python/contrib/test_ethos_compiler.py also in purest format.
In order to facilitate code review the code changes are split into a number of PRs.
- Unit test support for conv2 operator. This is the minimal amount of code that can work end to end.
- Build support. This includes CMake support and updates to scripts in tests/scripts/task_config_build_cpus.sh, driver stack build, minor changes in docker scripts.
- Runtime support. This is the inference code in the runtime.
- Unit test support. This is the directory that contains the common test code and a test for conv2d.
- Full operator support for Mobilenet. Complete the unit tests with all necessary operators with a PR issued for most operator separately.
- End to end test for Mobilenet. This cannot be fully tested without hardware support but we will add a round-trip test that re-uses results from a CPU execution so the flow can be tested end-to-end, as described above.
- IsSupported() support, based on PR BYOC #5261.
The following steps add support for more operators and networks. The required changes follow the same pattern: add compiler code, unit tests for operators, add a network test once a network is supported. Most if not all of the changes are in the area of front-end compiler support and appropriate tests.
We intend to track the BYOC infrastructure development in TVM as it happens as this work is heavily reliant on it.
As always, comments and suggestions are more than welcome.