[RFC] UMA: Universal Modular Accelerator Interface

UMA: Universal Modular Accelerator Interface

Feature Name: Universal Modular Accelerator Interface (UMA)
Start Date: 2022 February
Authors: 
  Paul Palomero Bernardo @paulpb, Christoph Gerum @cgerum - University of Tübingen
  Michael J. Klaiber @mjklaiber, Ingo Feldner - Bosch Research
  Philipp van Kempen @philippvk, Rafael Stahl @r.stahl, Daniel Müller-Gritschneder - Technical University of Munich
  Johannes Partzsch - TU Dresden
  Andrew Stevens - Infineon Technologies
RFC PR: https://github.com/apache/tvm-rfcs/pull/60
GitHub Issue: TBD

Summary

The goal of UMA (Universal Modular Accelerator Interface) is to create a unified infrastructure for easily integrating external accelerators into TVM. UMA provides file structures, Python interface classes and an API for accelerator integration. These interfaces and API are accessible from Python and are part of the components UMA Partitioner, UMA Lower and UMA Codgen. The features and proposals of Target registered compiler flow customization [TVM-RFC0011] and [TVM-RFC0010] are considered, with the difference that UMA tries to provide a more general interface for integrating new accelerators and one specific implementation of the hooks described in [TVM-RFC0011].


Image Source: Uma Thurman is The Bride | Pictured: Uma Thurman in a scene … | Flickr under CC BY-NC-ND 2.0

Update (2022-03-07):

The text below is the initial version of this pre-RFC. The changes resulting from this discussion thread can be found here:

Motivation

A number of accelerators have already been integrated into TVM, e.g. VTA, ARM EthosU. These are similar in both the structure of their build flow and the operations that they can offload. Nonetheless, due to incremental independent development, the TVM interfaces and processing steps used are quite different with little commonality. A consistent, unified, infrastructure would simplify accelerator integration making it accessible to smaller, hardware-focused, development teams.

Focus

UMA’s primary objective is to enable straight-forward TVM integration of loosely-coupled processor/microcontroller controlled accelerators. That is, accelerators capable of executing complete tensor operations or operation-graphs without host processor intervention. Secondary objectives are:

  • Support for closely-coupled accelerators (those offload parts of CPU computation for significant elements of tensor operations)
  • Compatibility with both run-time or ahead-of-time compilation
  • Support for heterogeneous execution utilizing accelerators optimized for specific operations or data types

Accelerator support or optimization functions outside the scope of UMA are:

  • Parallel execution on multi-accelerator architectures (to be handled by executor/run-time and customized layer splitting)
  • Real-time execution (to be handled by executor/run-time)
  • High-level support for parameter conversion like quantization or sparsity exploitation (to be realized via model pre-processing or in accelerator backends)

Reference-level explanation

Flow description

The figure below describes the UMA interface from a top level. An Accelerator Partitioner which is a specialization of the UMA Partitioner takes the Relay graph and matches for supported and unsupported operators. Unsupported operators are processed with the default TVM flow. Supported operator are processed with UMA Pipeline. In the following the tasks and the functionality of each block in the figure below is described:

UMA Partitioning:

  • Register relay passes
  • Register patterns - supported sub-graph operations
  • Order: pre-partitioning passes, Graph partitioning, post-partitioning passes
  • UMAPartitioner baseclass (Python only) has to be inherited by accelerator-specific Partitioners (e.g. Accelerator A Partitioner, etc)

The figure below described the UMA Pipeline. The blocks are described below:

UMA Pipelining:

  • Consists of UMALower and UMACogen, which implement the target hook Relay-to-TIR and TIR-to-Runtime (proposed in [TVM-RFC0010])
  • UMALower
    • Input: Partitioned composite functions
    • Custom primitives can be registered
    • Lowering from Relay to S-TIR, using TOPI or custom primitives
    • Interface for registering accelerator-specific schedules and passes
    • Execution of UMA schedules and passes on S-TIR
    • Output: NS-TIR(including tir.extern calls)
    • UMALower baseclass (Python only) has to be inherited by accelerator-specific Lower classes (e.g. Accelerator A Lower, etc)
  • UMACodegen
    • Input: NS-TIR(including tir.extern calls)
    • Defaults to standard TVM codegen
    • Intend is to provide a Python interface to insert/emit target code
    • UMACodegen baseclass has to be inherited by accelerator-specific Codegen classes (e.g. Accelerator A Codegen, etc)
    • Output: Target .c files

The intention is to use TensorIR and Relax with MetaScheduler for optimization.

Abbreviations: S-TIR: Schedulable TIR NS-TIR: Non-Schedulable TIR

File and class structure and Snippets as example for integration

UMA provides a mostly python-based API. On the C++ side, new targets are registered using target hooks (RFC #0010). A generic codegen.cc handles the calls to the python side.

.
├── codegen.cc
└── targets.cc
TVM_REGISTER_TARGET_KIND("accelerator_A", kDLCPU)
    .set_attr<FTVMRelayToTIR>("RelayToTIR", relay::contrib::generic::RelayToTIR("accelerator_A"))
    .set_attr<FTVMTIRToRuntime>("TIRToRuntime", relay::contrib::generic::accelerator_A::TIRToRuntime);

TVM_REGISTER_TARGET_KIND("accelerator_B", kDLCPU)
    .set_attr<FTVMRelayToTIR>("RelayToTIR", relay::contrib::generic::RelayToTIR("accelerator_B"));
    .set_attr<FTVMTIRToRuntime>("TIRToRuntime", relay::contrib::generic::accelerator_B::TIRToRuntime);

The python API is structured as shown below. Two base classes for relay graph partitioning and modification UMAPartitioner, and lowering from relay to TIR UMALower are building the core API. New custom accelerators are added in subdirectories by inheriting these two base classes.

.
├── partitioner.py
├── lower.py
├── utils.py
├── accelerator_A
│   ├── partitioner.py
│   ├── lower.py
│   ├── passes.py
│   ├── patterns.py
│   └── schedules.py
└── accelerator_B
    └── ...

The UMAPartitioner base class performs a target specific relay graph partitioning. New custom accelerators can control this process by registering supported patterns and relay passes using the provided API.

class MyCustomAcceleratorPartitioner(UMAPartitioner):
    @property
    def target_name(self):
        return "my_custom_accelerator"

    def _register_patterns(self):
        self._register_pattern("conv1d_relu", conv1d_relu_pattern())
    
    def _register_relay_passes(self):
        self._register_relay_pass(1, ConfigGenerator())
        self._register_relay_pass(2, BufferScopeAnnotator())

The UMALower base class performs a lowering from relay to TIR. New custom accelerators can control this process by registering custom schedules and TIR passes using the provided API.

class MyCustomAcceleratorLower(UMALower):
    def __init__(self):
        super(MyCustomAcceleratorLower, self).__init__()

    def _register_tir_schedules(self):
        self._register_tir_schedule(insert_extern_calls)

    def _register_tir_passes(self):
        self._register_tir_pass(0, GenerateConstants())

CC: @areusch @aca88 @SebastianBoblestETAS @jroesch @manupa-arm @cgerum @philippvk @r.stahl @mjs @ramana-arm

15 Likes

Thanks for posting this!

Overall I think the goals make sense and I agree having a standard API for accelerators to use is a good goal. Just a few questions to kick off the conversation.

Mostly looking to better understand the needs of the three components?

There has been talking of unifying the partitioners to use target specific annotations in the default fusion/partitioning flow, do you still need the UMAPartitioner in this case? or is the goal here to build a stable API which can map on to internal APIs as they change?

In that case would it make more sense to build a data structure representing the patterns vs. using imperative APIs? i.e. self._register_pattern("conv1d_relu", conv1d_relu_pattern())

In terms of the Lowering can some of the same functionality be accomplished by splitting the registration into per target schedule registrations like normal, and registering the passes using the hook? just trying to understand the tradeoffs here.

Then finally do you have an example of how the UMACodegen step would work?

1 Like

@MJKlaiber Thanks for the pre-RFC! Overall I’m very supportive of making it easier to integrate custom accelerators with TVM. I agree it makes sense to consider the set of interfaces an accelerator vendor may need to implement in order to bring their accelerator to TVM and try to harmonize them as much as possible so that there is a straightforward way to do this. This looks like a good way to organize a flow around accelerator development.

One of the challenges with adding several different lowering flows to TVM is understanding the advantages and drawbacks of each (hopefully there are really not so many drawbacks to any flow, but as with any system I’m sure they exist). At a high level, it’d be great if you guys could add additional motivation where you depart from the standard flow to explain what is difficult to accomplish with the existing standard flow. I definitely want to ensure we have flows in TVM to support a wide variety of hardware, but at the same time I want to make sure we minimize complexity of the compiler itself. Having this additional context will make it easier to understand why we need to override the standard flow.

The Prior Art/Alternatives part of the template RFC might also provide some framework that would help us to compare this flow with other options.

Couple other questions about the pre-RFC:

Just curious where you guys have gotten to with this part of the effort. Will this be in the initial PR(s)?

Is it possible to output other things? e.g. if TIR-to-Runtime assembles binary programming for an accelerator, is it possible to also output .bin or similar?

1 Like

Thanks @areusch and @jroesch for the input and great questions on this PRE-RFC :+1:. We really appreciate it. As this is a pre-RFC, we felt it is really important to get input from the TVM community as early as possible :slight_smile: .

The intent of UMA is mostly to a have stable API, so mapping it to another partitioner activity could really make sense. Could you provide a pointer to a description of the activity you have in mind?

That is great input! In our team discussion we concluded that your proposal to build a data structure representation makes more sense. We are currently also in favor of moving away from multiple base classes to a common UMABackendBase class.

Let’me give you a pain points why we think changes in the codegen should be possible for the standard developer: Adding an include statement like #include "accelerator_a_lib.h" to the target code requires to change codegen_c.cc and recompile (at least that a solution we are aware of). There are more cases like this, and we think that a Python interface is required.

How this would work? There could be multiples ways, e.g. packed calls into the codegen_c - we are trying to think from the user/developer perspective first here.

We are under the impression that there is no “standard flow” for accelerator. There are many paths that lead to the same outcome through the TVM flow. Difficult for a developer who has to integrate an accelerator is:

  • Defining the steps from Relay graph to TIR and from TIR to target code
  • Finding the hooks to register custom transformations for a new accelerator
  • For some changes a developer has to change the TVM code basis and recompile. It’s more convenient for a developer to call a Python interface than changing C++ code and recompile

TensorIR: yes

Relax: Probably not in the first PR. I attended the Relax meeting last time and was impressed by the progress and the elegance of the interface. Advantage of UMA would be that it is a stable API, i.e. the move from Relay from Relax should be easier.

Metascheduler: generally yes, depends on the timeline of the first PR.

We currently assume, that the primary target will be generated C-Code. Similar to EthosU binary command streams will be embedded in the generated C-code. Standalone binary command streams are planned, but we do not have a clear opinion of how to implement them.

We also consider outputting other files, e.g. memory initialization dumps and simulation graphs. @areusch and @jroesch, maybe you can help us to understand what the best options would be in this case. We do not want to have a create major change in codegen_c

CC: @cgerum @paulpb @philippvk @aca88 @SebastianBoblestETAS @r.stahl @jroesch @areusch @tqchen

Hi Michael, thanks for the proposal! Like others I’m very supportive of tightening up the BYOC interfaces.

My group here at OctoML have been looking at bringing a backend placement search capability to TVM, a la the ‘Collage’ paper (https://arxiv.org/pdf/2111.00655.pdf). Under that approach there’s no longer a notion of a BYOC uniquely partitioning the graph according to its rules and heuristics in ‘one shot’. Instead the BYOC must convey the rules (patterns, predicates) for which operators could potentially be offloaded, and leave the actual partitioning to the main Collage searcher.

Currently we have two mechanisms for conveying those rules:

  • pattern tables (triple of label, Relay pattern and predicate over the matched sub-expression)
  • per BYOC backend predicates associated with ops

My feeling is Collage would benefit if there was a well-known way of getting to the former, and we just port over the latter to the former to avoid a proliferation of equivalent mechanism. Though there is a global pattern registry it seems folks have realized it is not necessary to use it so BYOC integrations are inconsistent in their use of it.

Collage would also benefit if BYOC backends could be represented by Targets (as @Mousius at ARM has been working towards.) For example, both CUTLASS and TensortRT could be represented by Targets which refine that of the CUDA device. In this way the search space of placements can be controlled by including the relevant Targets in the list of heterogeneous targets, and the result of partitioning (irrespective of which implementation(s) actually do it) can be conveyed by a “target” annotation on a “Primitive” Relay Function.

I don’t think Collage has any implications for how lowering/codegen is dispatched, provided it is keyed by Target. However personally I think it may be better if we decompose that into:

  • well known places in the standard pipeline to insert new passes (esp just before built-in lowering)
  • a pass combinator that can filter based on “target” annotations

So part of registering a BYOC backend could be to both register the patterns and register the new passes wrapped by the above filtering combinator.

Very happy to work on this with you all – if we can get this right it will make our work much easier!

Best, -Mark

1 Like

We discussed this a bit offline; posting some brief outcomes of that call here and some additional response I had from before.

Agreed the Python interface is better. Does attaching pragma "import_c" work for your use case? There is also this RFC about tracking lib dependencies properly.

Yeah I agree there isn’t a standard set of steps to take when integrating an accelerator. Here I’m referring more to the set of steps taken by tvm.relay.build when an accelerator or library is offloaded.

We discussed this a bit offline and overall we agree that there is a desire to unify the “plumbing” part of the pipeline (e.g. ensure that UMA’s interface interacts with the standard tvm.relay.build flow using widely-used APIs. Here are the pieces we discussed:

  • Partitioning: UMA is using the standard TVM partitioner, registering patterns using the pattern-table infrastructure, and invoking the same 3 passes used elsewhere to partition. This is a reuse of existing infrastructure so no further discussion is needed here. Should UMA choose to use any different partitioning scheme, we would need to ensure we are agreed on how it marks the end result of partitioning on the IRModule (e.g. which attribute and how does that correspond to target, etc).
  • Codegen: UMA would like to provide a wrapper class which affords users the ability to implement a TIR-to-runtime Hook (NOTE: RFC’d but not yet landed in the codebase cc @Mousius actually this has landed, my apologies) for their target. @MJKlaiber let me know if this is not correct, but I think the overlap is pretty close to my understanding here.
  • Scheduling and post-scheduling passes: UMA would like to allow users to register custom passes and enable them based on some conditions. The exact conditions are yet to be discussed. We’ve discussed adding flexibility to do this based on the presence of a particular Target, but it would be great to spell this out here. Additionally, we need to discuss the points in the compilation flow where these passes should be run.

These last two bits are a bit complex and may be better discussed in a high-bandwidth setting. I’ll organize a community meeting so we can discuss them in an open forum sometime in the next few weeks.

Andrew, thanks for the summary :slight_smile: .

Thanks everyone for the great discussion @cgerum @paulpb @philippvk @r.stahl @areusch @jroesch @mbs-octoml!

Correct! Sounds good!

This seems to be already in main:

We agree. Let’s discuss point 2 and 3 in the next community meeting.

CC @cgerum @paulpb @Mousius @jroesch @mbs-octoml @aca88 @SebastianBoblestETAS

You are right, my apologies. I’ll edit the original post.

@areusch @paulpb is there going to be a discussion about this feature, perhaps on a community meeting? I would like to be there, I think this feature will greatly help the future integration of accelerators, something I am extremely interested in.

1 Like

@fPecc, Andrew @areusch has agreed to put it on the agenda of the next community meeting. Would be great to have as many interested community members there as possible to collect requirements and find a sweet spot for the API :+1:.

2 Likes

Hi @MJKlaiber ,

Apologies for not getting back to this in time. Thanks for the proposal! and it broadly looks like wrapping the Target Hooks RFC (by @Mousius ) : https://github.com/apache/tvm-rfcs/blob/main/rfcs/0010-target-registered-compiler-flow-customisation.md, and exposing a nice/structured interface to python. It is nice to see progress on this :slight_smile: .

I would like to suggest potential text changes for the formal RFC to those of us who are familiar with the existing flow (specially around naming).

Maybe it is worth mentioning these are current implemented as partition_for_<(backend\target)> ?

I am a bit curious, why this interface is specifically positioned as an “accelerator” (as in UMA) partitioner though ? i.e. Would it not be used to support optimized library support as we currently have today with BYOC ?

Since the proposal suggests to use the properly registered targets, any reason should we stick to target_name (str) as opposed to the actual TargetKind ?

Following up on the above question, what are your thoughts on moving the UMAPartitioner inside relay.build(…) ?

Also this seemed to be proposed on using S-TIR (as opposed to “legacy” TE->TIR pipeline), would you be able to share the motivation to the partitioning of tir_schedules and tir_passes ? (Im asking mainly because they will all be S-TIR → S-TIR IRModule passes).

Following from the above question, is there an ambition to handover S-TIR back to the core compiler ?

Following up on Mark’s comments,

Mark, we are quite looking forward for the RFC for this, especially related to reference-level explanation to see where this work is headed – which I believe might be better to know in this mutual interest of structuring BYOC targets.

However, I think we all share the ambition to replace kCompiler strings to be targets if can get more support from the community.

Our current PoC implementation uses KCompiler Attributes and the Standard MergeComposite, AnnotateTarget, MergeCompilerRegions.

The current plan is to move to the collage implementation by @mbs-octoml as soon as possible which would move partitioning into the relay.build.

We discussed this at the TVM Community Meeting this morning. There was a presentation about the approach followed by some discussion. Thanks @MJKlaiber @cgerum @SebastianBoblestETAS @paulpb @PhilippvK @r.stahl @aca88 for bringing this to the meeting!

Here are some notes (please feel free to correct them if I got anything wrong!):

  • The current graph partitioning approach is the same one that’s used in the compiler today. It’s compatible with the collage partitioning which is in the works and not yet RFC’d.

  • Would the v1 support Tensor Expression (TE), or are we skipping that?

    • Mikael understands CreatePrimFunc can support TE so should be natively supported
    • Paolo: using standard lowering as is done by Ethos-U
  • Proposal has an explicit differentiation between S-TIR adn NS-TRI. WOuld there be different hooks? e.g. here we can register TIR scheduling passes vs TIR passes.

    • Will it be possible to contribute S-TIR back to the compiler or just NS-TIR?
      • Scheduling passes work on S-TIR; passes in the boxes behind the schedules are injected into the lowering by pass context. Passes do not return S-TIR. They are part of the lowering from S-TIR to NS-TIR. At the moment, calling tvm.lower() and injecting those passes in to tvm.lower()
  • In Relay-to-TIR hook, already trying to figure out the lowering order, which might not match parittioning order. Want to see memory available after compiling c functions but before lowering Ethos-U functions. Any thoughts on whether it’s possible to configure the order of partitioning in this flow?

    • Why? Need to see the amount of live memory available after running the default TVM flow.
    • Relay passes can see the whole IRModule, past that only functions for a particular target are seen by a TIR pass.
    • The order needs to be decided and it varies by registration point.
  • Q: Are there common accelerator passes that are in use in TVM, or does everyone do something different?

    • There are common touch points, those are the “plumbing” mentioned in this slide presentation. e.g. Graph partitioning, scheduling, code-generation.
    • UMA isn’t trying to box anyone into a particular flow, instead it’s just trying to suggest one way doing this from a broader set of options to serve as a guide for folks who may be new to TVM.
  • Question from Federico, who is integrating an accelerator of his own.

    • VTA uses memory scopes to define buffers in block-ram. Are we planning to accommodate that in UMA?
      • You could write your own schedules and passes to do this. storage_scope is kind of the way to do this at the runtime level. You can also leverage USMP to define memory pools and use it as a pass to schedule.
2 Likes

Thanks everyone for the detailed input and feedback!

To keep track of the latest version of the UMA pre-RFC and to add the great suggestions that we got from this discussion thread, I created a document in our tvm-rfc fork :

CC: @areusch @mbs-octoml @jroesch @cgerum @paulpb @PhilippvK @r.stahl @aca88 @SebastianBoblestETAS @manupa-arm

thanks! feel free to open an RFC PR and we can iterate there if you like.

1 Like

PR in TVM-RFC:

1 Like

Hi community,

we are going to present the progress on the UMA RFC in today’s TVM community meeting.

Most important discussion points during the RFC review phase:

  • Target attributes
  • Phase naming: int vs enum
  • Interaction/Overlap with Collage

Thanks for the great discussion and input @areusch @manupa-arm @mbs-octoml @lhutton1 @sunggg !

Concrete next steps are tracked in this issue:

CC: @tqchen @SebastianBoblestETAS @aca88 @UlrikHjort @Khoi @lhutton1 @sunggg

Tracking issue:

https://github.com/apache/tvm/issues/11260

Michael, I’ve tested the uma cli test script for the vanilla mockup.
Now I would compile my TFLite model with the UMA backend
Could you share a sample script?

I first loaded a model using: mod = tvmc.load(“model.tflite”)
Then create uma backend, registered it.
Then passed the model to uma_backend.partition() but got multiple errors.

could you post the code and the error messages you are getting?

CC: @cgerum @paulpb