We have implemented basic support for generating and executing Hexagon code from TVM. We would like to upstream our work, and we are asking for community’s feedback about how to proceed.
Below is some background information about Hexagon itself, how it works, and what we’ve done. The “Summary” section at the end lists the main points with questions. We are looking forward to your comments.
Hexagon DSP is a 32-bit processor with its own ISA, and despite being called “DSP” it is really no different from a general purpose processor. In other words it can execute any kind of code, not just signal-processing computations, although with the addition of HVX, it is well suited for computational workloads. In practice in Snapdragon SoCs it always appears as a “subsystem”, where the main CPU is some version of ARM/AArch64, never as the CPU itself. As a consequence all communication with the Hexagon processor is done via the ARM CPU.
The communication mechanism between ARM and Hexagon is called FastRPC. Any function that is callable across processor boundaries needs to have a definition in an IDL, and for the set of IDL definitions, the build process in the end will produce two shared libraries: one for the ARM side (called “stub”) and the other for the Hexagon side (called “skel”). The IDL compiler is included in Hexagon SDK.
There is also a Hexagon simulator (for Linux on x86), which allows users to run Hexagon code without Hexagon hardware. The simulator exists as a standalone program, and as a library that can be linked into an x86 Linux binary. The simulator library provides a mechanism for the x86 application to access the memory of the simulated process, which can serve as another cross-processor communication mechanism (although much different from FastRPC). The simulator is a part of Hexagon toolchain, which is also a part of the Hexagon SDK.
While the ARM CPU (typically) runs Android, the OS functions on Hexagon are rather limited. On hardware, the Hexagon processor runs a special RTOS, and the simulator implements a set of built-in system calls, but neither is as developed as, for example, Linux.
The current implementation utilizes Hexagon in a somewhat limited way: it follows the way of a GPU, i.e. only computational kernels (loop nests marked as “pipeline”) are offloaded. The primary reason for it was to avoid having to support the TVM runtime on Hexagon. In the longer term the goal is to offload entire subgraphs to Hexagon. For the sake of testing, it should be possible to offload the entire graph to Hexagon.
We use LLVM to generate Hexagon code from TVM IR.
We support both execution environments: simulator and hardware. The hardware must run Android, and the SoC must have a Hexagon CDSP (we don’t support ADSP at the moment).
The simulator always executes a complete ELF binary, just like binaries that are executed from a Linux shell. In order to facilitate loading modules (i.e. shared libraries) and executing functions from them, we implemented a simple driver (or a “simulator runtime”). That runtime is a process that waits for commands from the program that instantiated the simulator (that is, the TVM runtime) and implements the simulator’s end of the DeviceAPI as well as the ability to load shared libraries and run functions. The executable for this process must be compiled by the Hexagon toolchain. The TVM itself can be compiled with any C++ compiler.
For running on hardware, the IDL libraries must be built. First, the IDL definitions must be processed via the IDL compiler (qaic, included in the Hexagon SDK). This will produce a set of C sources. These sources must then be compiled with an Android toolchain (to build the “stub” library), and by the Hexagon toolchain (to build the “skel” library). The TVM runtime (only runtime), must then be compiled with the NDK compiler.
The Hexagon toolchain is based on Clang. In all builds we use libc++ (not libstdc++) as the C++ library implementation.
For either simulator or hardware, we need to create a persistent object representing the state of the “executor”. Currently we do it via a global variable, but maybe there is a better way.
Structure of the code
We introduced a new device type: kDLHexagon, and a new target “hexagon”.
Most of the code is in src/runtime/hexagon, except for the IDL definitions and build scripts that we currently mainain in a separate repository. Ideally, we’d like to have it all in the TVM repository and we welcome feedback on how best to do this.
The minimum requirements for building TVM with Hexagon support depend on whether the execution environment is the simulator or hardware. For running on simulator, Hexagon toolchain v8.3 is needed, for running on hardware, Hexagon SDK 3.4.3+ and Android NDK r19 are required. Note: Hexagon toolchain 8.3 is included in the Hexagon SDK.
Hexagon HVX (vector engine) requires that all vectors are aligned in memory to a 128 byte boundary. Because of that, we have changed the following variables to 128:
We look for a way to better implement it than having it hardcoded.
We implemented Hexagon as a device (kDLHexagon), added target “hexagon”, added basic codegen via LLVM, implemented Hexagon runtime, and implemented a set of schedules.
Should we divide it up into smaller patches for review? Are there any suggestions as to what each patch should contain?
We support Hexagon simulator and Android hardware as execution targets. Both require special steps when building TVM, both can be assumed to require at least Hexagon SDK to be installed ahead of time.
How do we incorporate it into the build system? Should we ask users to run something like “make hexagon-prepare” (to build the IDL and the simulator driver), or should all the steps be done with just “make”?