[µTVM] microTVM M2 Roadmap

areusch · January 8, 2021, 10:36pm

Background

This document proposes a roadmap towards microTVM M2 (milestone 2). It aims to collect all of the µTVM thinking done across various forum threads, conversations, and from TVMConf. There are no specific deadlines included here—this document mainly serves to prioritize and group a related set of tasks towards some medium-term goals. That said, this proposal is written expecting the work to span roughly H1 2021.

Looking Back

Goals from the previous Standalone µTVM Roadmap are shown here along with some summary points:

Test simple models on supported hardware without writing any microcontroller code

It’s possible to run some models on supported hardware so long as they use a small subset of operators and those models fit into working memory
It’s difficult to get a list of supported operators
It’s difficult to decide whether a model fits into memory without directly examining the firmware binary
TVM drives the compilation flow, which can make debugging difficult

Easily package a tested model (from #1) into a C library

Compiling a Relay function under µTVM produces a C library
The TVM compilation flow doesn’t diverge from the standard TVM c flow
Public APIs to export the C library are not obvious/well-documented, but you can save the generated library
Compilation flags for the generated code need to be inferred
Libraries with external codegen are not supported yet

Easily autotune supported operators without having to write too much TVM code beyond the model definition

This is possible, but definitely not easy.
AutoTVM bridge for µTVM is still not checked in
The number of tunable schedules for supported targets is small

M2 Goals

This roadmap aims to enable these end-user goals:

Support more runtime environments

µTVM currently explicitly supports a fairly small set of runtime environments, even though its design encourages portability. TVM is constrained by:
1. excessive memory and flash overhead
2. lack of schedule support, both in terms of operators and in terms of ISAs
3. lack of support for multi-core and accelerators
µTVM command-line tools (tvmc)

Developers should be able to use µTVM as a tool rather than configuring µTVM to drive their development flow. µTVM examples, which are how µTVM can be used today, are effectively end-to-end tests—all of the logic needed to compile, flash, and run a µTVM model is in Python. By contrast, a typical DNN deployment project will have its own build and flash flow. µTVM should be accessible as a tool that can be integrated into a build process as in apps/bundle_deploy, which calls TVM from a Makefile. However, we should aim higher than this, providing support for typical µTVM actions such as model compilation, scheduling, and remote execution from tvmc.
Make it easy to characterize model footprint

It’s still difficult for developers to ask basic questions such as “how much Flash and RAM will my model require?” without manually examining the firmware or running inference on-device. At least for the RAM, and for Flash with the llvm backend, it should be possible for TVM to answer these questions.

Projects

We think these projects are the right ones to pursue in order to enable these goals. More detail for each of these projects will be given in RFCs to follow this one. Projects are listed roughly in the order we propose they be done (but most of these don’t conflict with one another, so given extra bandwidth they could be tackled at any point in time).

Library Generator

Description: Build a standard on-disk format for µTVM artifacts, including c or llvm generated operators, graph-level information (simplified parameters, graph JSON, e.g. Ahead-of-Time runtime module), and generated artifacts from BYOC.

Rationale: This work is toward Goal #1, making µTVM more useful as a tool. As discussed tangentially in this RFC, the output of µTVM varies considerably depending on whether c or llvm backend is used and whether BYOC flows are used. Additionally, there is no standard describing how to store the simplified parameters and graph JSON for µTVM. As we further enhance the µTVM compiler and produce more outputs, some standard structure is needed to make them more consumable.
Autotuning support

Description: Left over from previous milestone. Merge the final PR to support autotuning at main.
Ahead-of-Time Runtime

Description: Implement Ahead-of-Time compiler per the AoT RFC.

Rationale: The overhead due to the GraphRuntime and the way inference memory is handled is a significant barrier to deployment. These overheads exclude µTVM from a lot of deployment scenarios where it should be an option. As an example, although the current TVM C Runtime does not require the system malloc, in practice a system malloc is needed on our system solely due to overhead of the supplied memory allocator.
RISC-V ISA support

Description: Demonstrate support on another target_host ISA, RISC-V. Add at least one operator schedule. Finally, define a standard way to identify when hardware intrinsics are available for a given ISA, on both ARM and RISC-V targets.

Rationale: In order to remain a flexible deployment framework, TVM should ensure that it supports multiple ISA. Currently, ARM v8 and v7-M ISA are the only two ISA with any specific schedules written for them.
Comprehensive memory planner

Description: Right now the TVM memory planner can only consider operator input and output tensors. It largely ignores scratchpad space, and has no concept of memory regions. We propose to pull all of these concepts into the TVM memory planner and make the memory planner extensible/replaceable should a project need to write custom memory planning rules.

Rationale: In a confined microcontroller environment, these limitations are particularly punishing. Further, these limitations are the main reason it’s difficult to produce a peak memory estimate today. Finally, all memory is dynamically-allocated today, even if not from the system malloc. The overhead imposed by this dynamic memory allocator is large and needs to be eliminated to make µTVM an attractive deployment solution on embedded µC.
Project-level API (RFC)

Description: The TVM compilation flow (tvm.micro.Compiler and tvm.micro.Flasher, implemented for AutoTVM), is fairly tightly bound against the concepts of producing firmware binaries and libraries. We propose a new, project-centric API that allows for a smaller set of functions with fairly minimal inputs that roughly match those of a normal firmware development flow (codegen, build, flash, debug, execute).

Rationale: Implementing a tvm.micro.Compiler is too burdensome and doesn’t provide the benefit it was intended to. Further, the Flasher interface means that TVM needs to describe all of the artifacts from a project needed to flash a device. Rewriting the interface around typical firmware development actions will make it easier to implement this interface and de-couple TVM from the projects it generates. It will also allow the AutoTVM build flow to match the deployment build flow more closely, preventing misconfigurations.
tvmc integration

Description: In µTVM, the entire compile/debug process is driven end-to-end by TVM. We propose to split this process into pieces that can be invoked on their own: TVM compilation, firmware compilation, model execution, debugging. We also propose that the firmware compilation piece does not need to be driven by TVM (the user can compile the firmware on their own), and that it should be easy to drive model execution and debugging without configuring TVM to flash the target.

Rationale: µTVM examples right now are essentially end-to-end tests—all of the logic needed to run a µTVM model is in Python. While they do serve as a helpful reference to ensure µTVM is functional, they don’t correspond to a typical development flow. Further, debugging each piece individually is difficult because users need to write scripts to invoke them. Integrating with a user-facing endpoint such as tvmc would give those entry points a home in the TVM codebase.
Pinned Memory

Description: For tensors, produce a memory plan pinning the memory allocations for a given model inference as offsets or absolute addresses. Service all TVMDeviceAllocWorkspace and graph-level tensor allocations using the plan.

Rationale: Although the current TVM C Runtime does not require the system malloc, in practice malloc is needed because the supplied dynamic memory allocator has significant overhead. This overhead is unnecessary in many cases given that the memory footprint for most models is known in advance.
Footprint Estimation in tvmc

Description: Add additional analytics passes and output an estimate of the flash footprint (when using the llvm backend) and RAM footprint in tvmc.

Rationale: it is still difficult to answer basic questions such as: a) how much memory will my model take to execute? and b) which operators are supported? While examining the firmware binary will always be the ultimate source of truth, tvmcshould be able to produce a fairly accurate estimate.
AutoScheduling

Description: Implement at least one operator schedule using AutoScheduler on any supported ISA.

Rationale: New enhancements to TVM such as AutoScheduling should make it more tractable to write cross-device schedules. This project is in service of Goal #3. It aims to make it easier to achieve good performance on new SoC/ISAs.
Support multi-core and accelerator-based inference

Description: Explicitly support heterogenous and multi-core runtime environments. For multi-core and heterogenous execution, add support for TVM APIs that enable heterogenous execution, and make any other changes needed to support accelerators (i.e. support external codegen). Add additional schedules to improve performance on other operators.

Rationale: Accelerators and multi-CPU environments are becoming increasingly common on embedded µC.

Next Steps

This document is a proposal to the community, and feedback is welcomed. This thread is a great place to discuss the overall direction as well as if there are specific features or goals that are missing or that you’d like to see prioritized.

We intend to drive some of this work from OctoML, but there are a lot of tasks and there are plenty of ways to get involved.

Each of the projects listed here is typically implemented in phases:

Proof-of-Concept/RFC
PR reviews
Documentation updates

If you want to work on one of them, it’s best to post up an RFC/PoC to begin a discussion around it. There are also plenty of ways to make smaller contributions, including reviewing RFCs and PRs, adding polish and smaller features after each projects lands, and fixing bugs and updating documentation.