Motivation
The µTVM project can be thought of in two logical components that work together to execute models on device:
- A compiler that transforms Relay functions into a set of fused Relay operators, and then generates portable C functions to implement each group of fused operators. This is largely just the TVM compiler with a few modifications to target a minimal runtime.
- A minimal runtime compatible with bare-metal/RTOS environments.
To achieve its end goals, µTVM needs to be able to execute compiled Relay operators under two different workflows:
- Production workflow. The driver needs to be compiled into the device firmware and needs to allocate Tensor memory and invoke operator implementations in graph order. This workflow is not yet supported at
HEAD
, and there are a variety of implementation strategies that will be explored in the coming weeks. - AutoTVM/evaluation workflow. An attached host machine can drive overall model execution for evaluation without writing complete firmware, or choose to invoke one operator at a time for AutoTVM. Must be able to time operator execution for AutoTVM.
This RFC is concerned primarily with the AutoTVM/evaluation workflow, which is currently supported at HEAD
today with substantial limitations. Currently, µTVM loads a small runtime into RAM, writes TVMArgs
using GDB, populates a task list, and sets the device PC to the runtime entry point. This process can be invoked remotely on a TVM RPC Server by using the TVM Device API with a micro_dev
context.
This strategy uses a very minimal on-device runtime; however, it has some drawbacks:
- ISRs raised by the SoC aren’t handled and appear as timeouts. If the SoC enters an exception handler, it must be reset (sometimes, software reset is sufficient, and others a hard reset or board power cycle is necessary).
- The SoC needs to be configured by a program loaded in flash. There are a bunch of features that typically affect CPU performance: oscillator configuration, caches, and power modes, among others. Currently, the µTVM blogpost eval repo expects this mBED-based program to live in flash and execute on device startup to configure the SoC. However, this isn’t enforced or checked by TVM.
- For higher-bandwidth communication, device peripherals need to be configured. Drivers for these peripherals are typically written in C (rather than something usable from GDB) and expect to be able to use ISRs.
This RFC proposes to move the TVM RPC server onto the bare metal target, taking advantage of the RPC modularization PR and the tendency for embedded devices to contain stream-oriented peripherals. As an embedded device is generally smaller, some limitations will exist in the µTVM RPC Server:
- Only the C++ RPC Endpoint API will be exposed. Features that live behind PackedFuncs, such as the RPC proxying, etc won’t necessarily be included.
- Dynamic code loading won’t be supported initially (but may be possible in a limited fashion in a future RFC)
- Some message length and tensor rank limits will be stricter than those on the full Python hosted runtime
The goals of the µTVM On-Device RPC server are to allow users to evaluate models and to run AutoTVM. A non-goal of the µTVM On-Device RPC server is to handle model deployment.
Approach
Breaking from the previous µTVM strategy, this RFC proposes that µTVM builds binary images meant to be placed in device flash like any other long-lived firmware. This means that the µTVM RPC server binary is responsible for the following (in a typical AutoTVM session):
- SoC initialization (i.e. oscillator configuration, cache setup, etc)
- Handling interrupts
- Transmitting and receiving RPC protocol data over some peripheral
- Running the RPC server and resulting remote-triggered code
- Timing execution of TVM functions
Code Organization
A µTVM RPC Server binary can be thought of in 3 parts:
-
SoC Initialization, ISR handlers, and Device Drivers.
In order to achieve reproducible results, the SoC needs to be configured from a known good state e.g. from device reset. In some cases, a known good state is power-on, so this code needs to live in the SoC flash and be invoked directly from reset. This code is expected to live in repos outside the TVM repo, and should be configured per-device or per-project. The
main()
function exists here. - TVM MinRPC Server and C Runtime Supplied from the TVM repo and invoked by the code in part #1. Implements the TVM RPC server using the C Runtime.
- Compiled TVM model functions Built per target and integrated as the System library.
Each piece is discussed in detail below.
SoC Initialization, ISR Handlers, and Device Drivers
This code is intended to be specific to the targeted development board. It can be based on anything from a printf("Hello, world!\n")
demo to a fully-fledged RTOS; the requirements are:
- It needs to deterministically configure the SoC in terms of CPU performance
- It needs to facilitate UART-like communication over any peripheral the host can access (i.e. USB, Ethernet, semihosting).
- It needs to handle device ISRs and understand when the device has entered a bad state.
- It needs to provide memory for the µTVM RPC server to allocate function arguments and intermediate tensors.
This code does not live in the TVM repo, and is intended to just be referenced from autotuning scripts. Examples exist using the mBED and Zephyr RTOS.
As a secondary design goal, it should be able to make third-party libraries available to the µTVM RPC Server as PackedFunc. These may be used to validate preprocessing steps or capture data from an onboard sensor.
TVM MinRPC Server and C Runtime
The basic approach is to instantiate the MinRPC server, drive it using a message buffer, and use the MISRA-C runtime to handle the lower-level details of RPC calls. To facilitate this, some changes were necessary in the MISRA-C runtime (See “Changes to the MISRA-C Runtime”).
Compiled TVM Functions
This portion contains the SystemLib
TVMModule instance, plus functions to register it as such with the runtime.
MinRPC Server Design
The MinRPC server uses a blocking strategy, which isn’t particularly friendly to microcontrollers without RTOS, especially those with watchdog timers or other peripherals. However, the TVM RPC protocol is a message-oriented protocol and each message begins with a length:
+---------------------------+
| Message Length (uint64_t) |
+---------------------------+
| Message Body |
+---------------------------+
This means that each message boundary is well-defined—so for the µTVM RPC server, an event-driven approach can be safely used as follows:
- A message buffer accumulates data until a full mesage has been received. This part is non-blocking as it doesn’t involve the MinRPC Server.
-
MinRPC Server::ProcessOnePacket
is invoked.Read()
calls consume data from the message buffer. IfRead()
calls overrun the message buffer, it is aCHECK
failure. - The process repeats until MinRPC Server indicates it has shutdown.
Framing and Session
MinRPC Server assumes that the underlying transport provides the properties of UNIX pipes or TCP. Some additional components are needed to provide these guarantees over a UART. Specifically, these challenges are faced:
-
C1. The microcontroller’s
CHECK
failure strategy is to reset. This means that some wire protocol is needed for the µC to indicate that it has reset, even if only half of the previous message had been transmitted. This can be roughly thought of as a way to signal Connection Reset or Broken Pipe in a UNIX socket. However, details of CHECK failures can only be read after the microcontroller has rebooted, so there are some additional points to consider here. -
C2. As a protocol agnostic to the underlying transport, some level of error detection needs to be provided.
-
C3. A design constraint of the transport is that it should use very little memory and code space, but should be able to receive buffers that are large as a percentage of on-device RAM (i.e. >50%). This means that implementations which expect to buffer messages while performing error detection will limit the RPC protocol on device. By contrast, µTVM doesn’t care if the payload is written to a large DLTensor before a CRC error is detected. While the blocking nature of MinRPC server currently limits this, any error detection should pass the payload through even if it may contain invalid data.
A Framing layer addresses parts of C1 and all of C2. The wire format of one message is as follows:
+----------------------------------+
| Packet Start Escape (0xff 0xfd) |
+----------------------------------+
| Packet Length Bytes (uint32_t) |
+----------------------------------+
| Payload |
+----------------------------------+
| CRC-16 (CCITT, little-endian) |
+----------------------------------+
An escape character (0xff
) is used to start a framing layer control sequence. All fields (except the packet start field) need to be escaped on the wire. Control sequences are at most 2 bytes long, the second byte indicating the sequence. Possible values are:
-
0xff
- Escaped 0xff (so, translateff ff
on the wire to a singleff
of payload/length/CRC data) -
0xfe
- Nop. Used to signal device reset. -
0xfd
- Packet Start. Signals the beginning of a new packet. If a framing layer receives Packet Start while already decoding a packet, the packet being decoded is dropped.
While the RPC server is implemented using blocking Read()
calls, there is also a maximum packet length value enforced.
The exact values used here might be adjusted, since 0xff
is likely a fairly common byte in DLTensor
s.
A Session layer handles out-of-band signaling and addresses the remainder of C1 and C3. Session Messages have the following structure:
+----------------------------+
| Message Type Code (1 byte) |
+----------------------------+
| Session ID (2 bytes) |
+----------------------------+
| Message Payload |
+----------------------------+
The following message types are supported:
- Session Start Init. Starts a new session. Either party to the link can send this message; the sending side becomes termed the initiator. This message contains the initiator’s nonce, which forms half of the session id. Should two Session Start Init messages be sent simultaneously, the message containing the numerically-lower nonce wins (the other message is ignored).
- Session Start Reply. Confirms the new session as started. The party sending this message is termed the responder. Contains the full session id to be used in subsequent traffic.
- Terminate Session. Contains no session id; invalidates any previously-established session. Devices should send this message after resetting, in case the other party is awaiting a reply.
- Log Message. Allows the device, which typically has no connected display, to asynchronously print diagnostic log messages on the host. Mostly helpful for debugging. Log messages are always sent with session id 0 and are valid regardless of whether a session is established.
- Normal Traffic. Standard µTVM RPC traffic. Each Session message contains exactly one TVM RPC message. The session id must match the session id established during the Session Start handshake.
Session Handshake
Before normal traffic can be exchanged, a session ID is established using a two-way handshake. Session IDs are 2 bytes: 1 byte populated by the initiator and 1 by the responder. The handshake is as follows:
Initiator: Responder:
+--------------------------+
| Type: Session Start Init |
+--------------------------+ --->
| I_Nonce 0x00 |
+--------------------------+
+---------------------------+
| Type: Session Start Reply |
<--- +---------------------------+
| I_Nonce R_Nonce |
+---------------------------+
(session established, ID is {I_Nonce, R_Nonce}
Session Termination
When a Terminate Session message is received, the receiving party should assume that the sender has lost all state. The proposed PR raises an exception back to Python in this case.
Long Messages
µTVM RPC server faces a somewhat unique challenge in that some messages (e.g. CopyToRemote) may have very large payloads relative to the amount of available memory. At present, the proposed implementation can’t receive messages like this; however, a future PR could rewrite MinRPCServer to handle the message header and payload separately. Then, CopyToRemote could progressively write the payload directly to the allocated tensor space in a zero-copy fashion.
Testing
Initially testing will be done by compiling a µTVM RPC server targeted to the host machine, invoking it as a subprocess, and using stdin/stdout as the transport pipes. Most blackbox testing should be able to be accomplished in this way. To catch cross-compilation errors, a qemu-based M3-based target could be used.
Some additional unit testing is done using googletest; this could also be ported to a target to validate it. However, this is somewhat more involved so isn’t done in the PR yet.
Points for Discussion
- Is the CRC layer adequate given packet sizes?
- Use a 16-bit CRC as done here and add an explicit packet length limit of around 16K. Tensors longer than 16K, and modules (if loadable modules are implemented in the future to alleviate flash stress) will need to be split into multiple messages.
- Use a 32-bit CRC, which will take more flash space and/or longer to execute, but allow longer packets