[RFC][BYOC] Android NNAPI Integration

Motivation

Android NNAPI is a graph-level neural network inference API provided by the Android runtime. It’s intended to provide the best execution setup to machine learning inferences based on available resources, including custom accelerators from SoC vendors, on mobile devices. This RFC aims to enable TVM to codegen for Android NNAPI with the Relay BYOC framework.

Android NNAPI Introduction

Proposal

We have been working on enabling Android NNAPI for TVM for a while, and the project, including the partitioning and converter (codegen) components, is presented at the TVM conference 2020. Now we intend to contribute the project to the TVM mainstream in hope that it benefits the community and also for better support and maintenance.

The project is divided into 2 parts: the partitioner and the converter. The partitioner determines which parts of the graph are executed with Android NNAPI and which parts are not. The converter is registered through the BYOC framework so it gets invoked during the compilation process.

Compilation to Runtime Process

  1. Start with a Relay IR graph and params from TVM frontend
  2. Invoke the partitioner with both the graph and params to annotate/transform the graph and/or params for Android NNAPI
  3. Compile the module, which would invoke the converter to codegen for Android NNAPI as sub-modules
  4. Export the compile module with tvm.contrib.ndk.create_shared as the compiling function (linking to the Android NNAPI library and TVM runtime is required).
  5. Write a C++ runtime script to do the inference using TVM API. This script can be compiled into a shared library and be invoked through JNI on Android

Building with Android NNAPI Support

The partitioner is a pure out-of-source Python implementation, so build options isn’t involved in here.

The converter includes an in-source C++ stub (registered as relay.ext.nnapi_compiler) that handles the infrastructure in a single C++ translation unit (C/C++ headers, TVM PackedFunc wrappers, … just like codegen_c), and an out-of-source Python code generator. The C++ stub would pass each Relay function it received to the Python code generator (registered as relay.ext.nnapi_compiler.relayir_to_nnapi_converter) that codegens for Android NNAPI. Currently, there’s no control over the build options regarding whether to build with Android NNAPI support, since the C++ that requires building is a simple stub and should not take much time to compile.

Since for now, most of the project is out-of-source with regard to the TVM repository, in the upcoming months, we’ll move these codes to the contrib folders of TVM and send them out as PRs. During the process, there will inevitably be some restructuring work, so feel free to express your thoughts on how we should place the codes in TVM to make it right :slight_smile:

The Partitioner (Annotate)

While the partitioner we presented at the conference is based on RPC to profile for operator costs, we have concerns about contributing that one to the TVM mainstream due to the requirement of setting up the phone. However, we believe some sort of annotator is still required, so we need the community’s suggestion on this part.

We’ve listed a few options that comes to mind:

  • Instead of RPC profiling, we can assign a heuristic static/calculated cost and use the proposed DP partition algorithm for partitioning, or
  • Register a few operators which we believe should be hardware-accelerated on most devices, and use the official annotation method (transform.AnnotateTarget), or
  • Still contribute the RPC profiling-based partitioner, but with a detailed document on how to setup RPC on Android phones

The first two options enable users without phones nearby to cross-compile with Android NNAPI integration, which is easier to start with, while the 3rd options should perform better than the first two on most cases, but requires a complex RPC setup.

Which way should we go? Please feel free to leave your comments below.

The Converter (Codegen)

The converter generates C code of sub-graphs in Android NNAPI format. In the converter, we do the following things:

  1. Convert the Relay IR sub-graph into a custom JSON format, which is designed to describe Android NNAPI models
  2. The JSON description gets converted into a single C++ class that setups the Android NNAPI model and provides an execution handle

The reason behind using a C++ class instead of straight forward computing function is that Android NNAPI’s programming model involves its own graph construction and compilation phase, which if put in the class constructor and get the instance created using the C++ static keyword, can be done only once for multiple invocations.

The converter shall only perform the conversion to its best effort, at most with some semantic-preserving rewrite, e.g. expansion of batch normalization, which means that the user should make sure their partitioner produces suitable sub-graphs. The current implementation only supports conversion of float32/16 sub-graphs with a limited set of operators, and all inputs (including model weights) to the sub-graph is fed by the TVM runtime instead of loading from the filesystem.

Testing

Currently only the converter is tested. It’s tested op-by-op with direct invocation of the converter instead of going through the BYOC framework. The resulting C++ class is compared to a predefined corresponding C++ class in a text-based manner for equivalence check.

PR Plan

  1. 1st PR: Add basic converter that supports only nn.conv2d
  2. 2nd PR: Add partitioner/annotator
  3. 3rd PR: The documents
  4. More PRs: More operator support

Thanks for reading. Any discussions are welcomed.

4 Likes

Thanks for the RFC and it looks promising :slight_smile:

Also cc @zhiics, @trevor-m, @FrozenGene

The first two options enable users without phones nearby to cross-compile with Android NNAPI integration, which is easier to start with, while the 3rd options should perform better than the first two on most cases, but requires a complex RPC setup. Which way should we go? Please feel free to leave your comments below.

We definitely need at least one solution that doesn’t require on-device profiling. I would suggest having 1 or 2 plus 3, and let users configure the approach based on their requirements. For the choose of 1 and 2, there are some points you can refer to for judgement:

The second approach is indeed more general and easier to implement, because the heuristic cost may be inaccurate in different devices. Based on our experience, the most important point to consider during annotation/partition is the number and size of offloaded subgraphs. Intuitively, we want to merge subgraphs as many as possible to reduce the overhead, but if we only annotate high-cost operators such as conv2d, it’s likely that you’ll get many small subgraphs cut by other operators such as transpose.

Accordingly, you may consider a hybrid appraoch as TensorRT integration does: First annotate all supported operators regardless their costs, and merge them using the provided pass. After the partition, you could have another pass to determine whether a subgraph should be converted to NNAPI based on the costs. The cost here can apply either static costs or profiled costs. If a subgraph should remain on the TVM backend, you just simply remove the corresponding Relay function attribute.

Ref: tvm/tensorrt.py at 91e07e1f3a7fe6ca047bf2acf57880f0b5393395 · apache/tvm · GitHub

  1. The JSON description gets converted into a single C++ class that setups the Android NNAPI model and provides an execution handle

Would you elaborate a bit more about how the flow works in BYOC codegen? In particular when will the compilation happen? IIUC, the flow could be either of the following:

lib = rela.build(...)   # Generate JSON and store it in lib.
lib.export_library(...) # Compile the JSON to the C++ class.
lib = relay.build(...) # Generate JSON and compile it to the C++ class.

The main difference is when and how to invoke g++ and they have pros and cons. The first approach could let TVM invoke g++ and users have to provide the additional arguments such as NNAPI library path. The second approach, on the other hand, hides this detail in your NNAPI codegen, but it means you have to invoke g++ by yourself and handle all the configurations.

Testing

We need to test both annotation/partiton and converter. For annotation/partition, we could check the partitioned Relay graph to see if it has the expected structure. Please refer to the unit tests of other BYOC integrations. For the converter, it’s not safe to compare the source code of generated C++ classes, as the format matters and even one more space could fail this test. Of course, the most promising way is to test the runtime directly, but this is unrealistic in the CI. For now the only solution I can think of is just testing whether the compilation is success or not.

PR Plan

If we are going to test the compilation with NNAPI, we will need to add the corresponding library to the CI, so it may require another PR to update the Docker image.

1 Like

Thanks for the RFC. It looks great in general.

For the annotator, I agree with @comaniac’s suggestion that having 1 or 2 with 3 would be better. I also think option 2 is easier but better.

For PRs, I would suggest we merge PR1 and PR2 together so that we could have an end-to-end test at the beginning because I think the first one would be small anyways. But this would mean that we need to setup the CI ready with required packages/libraries.

@comaniac , @zhiics Thanks for your great replies. Let me just summarize and reply to these:

Partitioning/Annotation

Thank you for the suggestion @comaniac . I’ll implement the select, merge and prune approach. This approach looks promising to provide a partition that is good enough in most cases.

The C++ class that setups Android NNAPI models

The programming model of Android NNAPI is like follows:

  1. Construct model graphs using Android NNAPI (ANeuralNetworksModel_create(), ANeuralNetworksModel_addOperand(), ANeuralNetworksModel_addOperation(), …)
  2. Compile the constructed model (ANeuralNetworksCompilation_create(), …)
  3. Setup execution instances of the compiled model (ANeuralNetworksExecution_create(), …)

, which all happen at the runtime. For the runtime model construction and compilation, I can have them in the C++ class constructor so that a model only gets constructed and compiled once.

This means the flow would look more like this:

  1. TVM BYOC generates the C++ class that includes ANeuralNetworksModel_create(), ANeuralNetworksCompilation_create(), ANeuralNetworksExecution_create(), …
  2. The result of 1. , which is C++ source, gets compiled by Android NDK, i.e. clang++, into a TVM-format shared library with lib.export_library()
  3. The result of 2. constructs, compiles, and executes the Android NNAPI model during runtime when invoked

I hope this clarifies the somewhat mysterious compilation process :slight_smile: .

Testing

Thanks for the advice of checking Relay structures for the annotation part. I’ll reference other unit tests when writing such tests :+1: . However, IIUC, this approach only applies to the non-profiling partition, so the profiling-based partition is still un-tested. Testing the profiling-based partition is a bit troublesome due to (1) the need to setup the RPC in testing environment and (2) the consistency of profiling results. IMHO, these are not solvable given the conditions. What do you guys think?

For the converter part, due to the runtime compilation phase of Android NNAPI, testing for Android NDK compilation success is not really a reassuring method. Passing Android NDK does guarantee that the shared library can be loaded by the TVM runtime, but it does not promise a successful replica of the Relay sub-graph, and that’s why we resorted to text-based diff. I agree the text-based manner is a bit too strict for the C++ source codes, even though we’ve already made some effort to canonicalize C++ texts, but it ensures the correct behavior of the converter.

In summary:

  • Testing for Android NDK compilation
    • Pros: Allow for relatively large errors, which may be preferred when integrated into TVM testing pipeline
    • Pros: Easier to implement (no predefined C++ source code)
    • Cons: Does not prevent corrupted Relay-to-NNAPI conversion
  • Text-based diff
    • Pros: Ensure correct Relay-to-NNAPI conversion
    • Cons: Harder to implement (require predefined C++ source code)
    • Cons: Easy to break

, which actually turns out to be a trade-off between ensuring the correct behavior and ease of maintenance. Apart from the unit testing of the converter, this also hinders the end-to-end testing, so @comaniac , can you have a second thought on this and let me know that you think?

PR Plan

If the Android NDK compilation approach is to be taken, I’ll have another PR to add Android NDK to TVM CI images.

@zhiics PR 1 and 2 can surely merged and sent out all together to have end-to-end tests :slight_smile: .

Thanks for the clarification, so the codegen has to generate and compile C++ code to be a shared library; while runtime needs to construct a model graph (or engine). It seems clear to me, and we could discuss the implementation detail about when and where to invoke clang++ in the PR.

For testing the partition, we definitely cannot directly test profiling-based partition in the CI. Instead, it is reasonable to mock the profiling APIs with hard-coded latencies, so no RPC will be involved in testing.

For testing the converter, I guess it might be fine if the referred C++ code is not that long so that you can hard code it to the unit test. Accordingly, I suggest the following unit tests:

  1. When testing a small graphs with 1-2 ops, we have a hard coded C++ code string to be compared with the generated C++ code. After the code is matched, we try to compile it to see if there has any issue.
  2. When testing a small graphs (e.g., the whole network), we only try to compile the generated C++ code without test-based checking.

My point is, if the expected C++ code are hard coded in the unit test, anyone could dive into and fix the test once it fails. In this way, we guarantee the maintenance.

@comaniac Great idea to only test for compilation success for large sub-graphs. I agree on this approach. :+1:

For the clang++ part, it currently gets invoked in tvm.contrib.ndk.create_shared, which is passed as the fcompile to lib.export_library(). I guess it’s good just as it is right now? The concern, if any, is that the arguments needs to be overrided with linkage to libneuralnetworks.so and tvm source/dependencies, which may be a bit unfamiliar for beginners. Should we create some kind of custom compilation function that passes these options for end users?

Sounds good to me for now. We could check whether it is too user unfriendly or not in the PR and improve it if needed.