Motivation
Android NNAPI is a graph-level neural network inference API provided by the Android runtime. It’s intended to provide the best execution setup to machine learning inferences based on available resources, including custom accelerators from SoC vendors, on mobile devices. This RFC aims to enable TVM to codegen for Android NNAPI with the Relay BYOC framework.
Proposal
We have been working on enabling Android NNAPI for TVM for a while, and the project, including the partitioning and converter (codegen) components, is presented at the TVM conference 2020. Now we intend to contribute the project to the TVM mainstream in hope that it benefits the community and also for better support and maintenance.
The project is divided into 2 parts: the partitioner and the converter. The partitioner determines which parts of the graph are executed with Android NNAPI and which parts are not. The converter is registered through the BYOC framework so it gets invoked during the compilation process.
Compilation to Runtime Process
- Start with a Relay IR graph and params from TVM frontend
- Invoke the partitioner with both the graph and params to annotate/transform the graph and/or params for Android NNAPI
- Compile the module, which would invoke the converter to codegen for Android NNAPI as sub-modules
- Export the compile module with
tvm.contrib.ndk.create_shared
as the compiling function (linking to the Android NNAPI library and TVM runtime is required). - Write a C++ runtime script to do the inference using TVM API. This script can be compiled into a shared library and be invoked through JNI on Android
Building with Android NNAPI Support
The partitioner is a pure out-of-source Python implementation, so build options isn’t involved in here.
The converter includes an in-source C++ stub (registered as relay.ext.nnapi_compiler
) that handles the infrastructure in a single C++ translation unit (C/C++ headers, TVM PackedFunc wrappers, … just like codegen_c
), and an out-of-source Python code generator. The C++ stub would pass each Relay function it received to the Python code generator (registered as relay.ext.nnapi_compiler.relayir_to_nnapi_converter
) that codegens for Android NNAPI. Currently, there’s no control over the build options regarding whether to build with Android NNAPI support, since the C++ that requires building is a simple stub and should not take much time to compile.
Since for now, most of the project is out-of-source with regard to the TVM repository, in the upcoming months, we’ll move these codes to the contrib
folders of TVM and send them out as PRs. During the process, there will inevitably be some restructuring work, so feel free to express your thoughts on how we should place the codes in TVM to make it right
The Partitioner (Annotate)
While the partitioner we presented at the conference is based on RPC to profile for operator costs, we have concerns about contributing that one to the TVM mainstream due to the requirement of setting up the phone. However, we believe some sort of annotator is still required, so we need the community’s suggestion on this part.
We’ve listed a few options that comes to mind:
- Instead of RPC profiling, we can assign a heuristic static/calculated cost and use the proposed DP partition algorithm for partitioning, or
- Register a few operators which we believe should be hardware-accelerated on most devices, and use the official annotation method (
transform.AnnotateTarget
), or - Still contribute the RPC profiling-based partitioner, but with a detailed document on how to setup RPC on Android phones
The first two options enable users without phones nearby to cross-compile with Android NNAPI integration, which is easier to start with, while the 3rd options should perform better than the first two on most cases, but requires a complex RPC setup.
Which way should we go? Please feel free to leave your comments below.
The Converter (Codegen)
The converter generates C code of sub-graphs in Android NNAPI format. In the converter, we do the following things:
- Convert the Relay IR sub-graph into a custom JSON format, which is designed to describe Android NNAPI models
- The JSON description gets converted into a single C++ class that setups the Android NNAPI model and provides an execution handle
The reason behind using a C++ class instead of straight forward computing function is that Android NNAPI’s programming model involves its own graph construction and compilation phase, which if put in the class constructor and get the instance created using the C++ static
keyword, can be done only once for multiple invocations.
The converter shall only perform the conversion to its best effort, at most with some semantic-preserving rewrite, e.g. expansion of batch normalization, which means that the user should make sure their partitioner produces suitable sub-graphs. The current implementation only supports conversion of float32/16 sub-graphs with a limited set of operators, and all inputs (including model weights) to the sub-graph is fed by the TVM runtime instead of loading from the filesystem.
Testing
Currently only the converter is tested. It’s tested op-by-op with direct invocation of the converter instead of going through the BYOC framework. The resulting C++ class is compared to a predefined corresponding C++ class in a text-based manner for equivalence check.
PR Plan
- 1st PR: Add basic converter that supports only
nn.conv2d
- 2nd PR: Add partitioner/annotator
- 3rd PR: The documents
- More PRs: More operator support
Thanks for reading. Any discussions are welcomed.