[RFC] UMA: Universal Modular Accelerator Interface

@MJKlaiber Thanks for the pre-RFC! Overall I’m very supportive of making it easier to integrate custom accelerators with TVM. I agree it makes sense to consider the set of interfaces an accelerator vendor may need to implement in order to bring their accelerator to TVM and try to harmonize them as much as possible so that there is a straightforward way to do this. This looks like a good way to organize a flow around accelerator development.

One of the challenges with adding several different lowering flows to TVM is understanding the advantages and drawbacks of each (hopefully there are really not so many drawbacks to any flow, but as with any system I’m sure they exist). At a high level, it’d be great if you guys could add additional motivation where you depart from the standard flow to explain what is difficult to accomplish with the existing standard flow. I definitely want to ensure we have flows in TVM to support a wide variety of hardware, but at the same time I want to make sure we minimize complexity of the compiler itself. Having this additional context will make it easier to understand why we need to override the standard flow.

The Prior Art/Alternatives part of the template RFC might also provide some framework that would help us to compare this flow with other options.

Couple other questions about the pre-RFC:

Just curious where you guys have gotten to with this part of the effort. Will this be in the initial PR(s)?

Is it possible to output other things? e.g. if TIR-to-Runtime assembles binary programming for an accelerator, is it possible to also output .bin or similar?

1 Like

Thanks @areusch and @jroesch for the input and great questions on this PRE-RFC :+1:. We really appreciate it. As this is a pre-RFC, we felt it is really important to get input from the TVM community as early as possible :slight_smile: .

The intent of UMA is mostly to a have stable API, so mapping it to another partitioner activity could really make sense. Could you provide a pointer to a description of the activity you have in mind?

That is great input! In our team discussion we concluded that your proposal to build a data structure representation makes more sense. We are currently also in favor of moving away from multiple base classes to a common UMABackendBase class.

Let’me give you a pain points why we think changes in the codegen should be possible for the standard developer: Adding an include statement like #include "accelerator_a_lib.h" to the target code requires to change codegen_c.cc and recompile (at least that a solution we are aware of). There are more cases like this, and we think that a Python interface is required.

How this would work? There could be multiples ways, e.g. packed calls into the codegen_c - we are trying to think from the user/developer perspective first here.

We are under the impression that there is no “standard flow” for accelerator. There are many paths that lead to the same outcome through the TVM flow. Difficult for a developer who has to integrate an accelerator is:

  • Defining the steps from Relay graph to TIR and from TIR to target code
  • Finding the hooks to register custom transformations for a new accelerator
  • For some changes a developer has to change the TVM code basis and recompile. It’s more convenient for a developer to call a Python interface than changing C++ code and recompile

TensorIR: yes

Relax: Probably not in the first PR. I attended the Relax meeting last time and was impressed by the progress and the elegance of the interface. Advantage of UMA would be that it is a stable API, i.e. the move from Relay from Relax should be easier.

Metascheduler: generally yes, depends on the timeline of the first PR.

We currently assume, that the primary target will be generated C-Code. Similar to EthosU binary command streams will be embedded in the generated C-code. Standalone binary command streams are planned, but we do not have a clear opinion of how to implement them.

We also consider outputting other files, e.g. memory initialization dumps and simulation graphs. @areusch and @jroesch, maybe you can help us to understand what the best options would be in this case. We do not want to have a create major change in codegen_c

CC: @cgerum @paulpb @philippvk @aca88 @SebastianBoblestETAS @r.stahl @jroesch @areusch @tqchen

Hi Michael, thanks for the proposal! Like others I’m very supportive of tightening up the BYOC interfaces.

My group here at OctoML have been looking at bringing a backend placement search capability to TVM, a la the ‘Collage’ paper (https://arxiv.org/pdf/2111.00655.pdf). Under that approach there’s no longer a notion of a BYOC uniquely partitioning the graph according to its rules and heuristics in ‘one shot’. Instead the BYOC must convey the rules (patterns, predicates) for which operators could potentially be offloaded, and leave the actual partitioning to the main Collage searcher.

Currently we have two mechanisms for conveying those rules:

  • pattern tables (triple of label, Relay pattern and predicate over the matched sub-expression)
  • per BYOC backend predicates associated with ops

My feeling is Collage would benefit if there was a well-known way of getting to the former, and we just port over the latter to the former to avoid a proliferation of equivalent mechanism. Though there is a global pattern registry it seems folks have realized it is not necessary to use it so BYOC integrations are inconsistent in their use of it.

Collage would also benefit if BYOC backends could be represented by Targets (as @Mousius at ARM has been working towards.) For example, both CUTLASS and TensortRT could be represented by Targets which refine that of the CUDA device. In this way the search space of placements can be controlled by including the relevant Targets in the list of heterogeneous targets, and the result of partitioning (irrespective of which implementation(s) actually do it) can be conveyed by a “target” annotation on a “Primitive” Relay Function.

I don’t think Collage has any implications for how lowering/codegen is dispatched, provided it is keyed by Target. However personally I think it may be better if we decompose that into:

  • well known places in the standard pipeline to insert new passes (esp just before built-in lowering)
  • a pass combinator that can filter based on “target” annotations

So part of registering a BYOC backend could be to both register the patterns and register the new passes wrapped by the above filtering combinator.

Very happy to work on this with you all – if we can get this right it will make our work much easier!

Best, -Mark

1 Like

We discussed this a bit offline; posting some brief outcomes of that call here and some additional response I had from before.

Agreed the Python interface is better. Does attaching pragma "import_c" work for your use case? There is also this RFC about tracking lib dependencies properly.

Yeah I agree there isn’t a standard set of steps to take when integrating an accelerator. Here I’m referring more to the set of steps taken by tvm.relay.build when an accelerator or library is offloaded.

We discussed this a bit offline and overall we agree that there is a desire to unify the “plumbing” part of the pipeline (e.g. ensure that UMA’s interface interacts with the standard tvm.relay.build flow using widely-used APIs. Here are the pieces we discussed:

  • Partitioning: UMA is using the standard TVM partitioner, registering patterns using the pattern-table infrastructure, and invoking the same 3 passes used elsewhere to partition. This is a reuse of existing infrastructure so no further discussion is needed here. Should UMA choose to use any different partitioning scheme, we would need to ensure we are agreed on how it marks the end result of partitioning on the IRModule (e.g. which attribute and how does that correspond to target, etc).
  • Codegen: UMA would like to provide a wrapper class which affords users the ability to implement a TIR-to-runtime Hook (NOTE: RFC’d but not yet landed in the codebase cc @Mousius actually this has landed, my apologies) for their target. @MJKlaiber let me know if this is not correct, but I think the overlap is pretty close to my understanding here.
  • Scheduling and post-scheduling passes: UMA would like to allow users to register custom passes and enable them based on some conditions. The exact conditions are yet to be discussed. We’ve discussed adding flexibility to do this based on the presence of a particular Target, but it would be great to spell this out here. Additionally, we need to discuss the points in the compilation flow where these passes should be run.

These last two bits are a bit complex and may be better discussed in a high-bandwidth setting. I’ll organize a community meeting so we can discuss them in an open forum sometime in the next few weeks.

Andrew, thanks for the summary :slight_smile: .

Thanks everyone for the great discussion @cgerum @paulpb @philippvk @r.stahl @areusch @jroesch @mbs-octoml!

Correct! Sounds good!

This seems to be already in main:

We agree. Let’s discuss point 2 and 3 in the next community meeting.

CC @cgerum @paulpb @Mousius @jroesch @mbs-octoml @aca88 @SebastianBoblestETAS

You are right, my apologies. I’ll edit the original post.

@areusch @paulpb is there going to be a discussion about this feature, perhaps on a community meeting? I would like to be there, I think this feature will greatly help the future integration of accelerators, something I am extremely interested in.

1 Like

@fPecc, Andrew @areusch has agreed to put it on the agenda of the next community meeting. Would be great to have as many interested community members there as possible to collect requirements and find a sweet spot for the API :+1:.

2 Likes

Hi @MJKlaiber ,

Apologies for not getting back to this in time. Thanks for the proposal! and it broadly looks like wrapping the Target Hooks RFC (by @Mousius ) : https://github.com/apache/tvm-rfcs/blob/main/rfcs/0010-target-registered-compiler-flow-customisation.md, and exposing a nice/structured interface to python. It is nice to see progress on this :slight_smile: .

I would like to suggest potential text changes for the formal RFC to those of us who are familiar with the existing flow (specially around naming).

Maybe it is worth mentioning these are current implemented as partition_for_<(backend\target)> ?

I am a bit curious, why this interface is specifically positioned as an “accelerator” (as in UMA) partitioner though ? i.e. Would it not be used to support optimized library support as we currently have today with BYOC ?

Since the proposal suggests to use the properly registered targets, any reason should we stick to target_name (str) as opposed to the actual TargetKind ?

Following up on the above question, what are your thoughts on moving the UMAPartitioner inside relay.build(…) ?

Also this seemed to be proposed on using S-TIR (as opposed to “legacy” TE->TIR pipeline), would you be able to share the motivation to the partitioning of tir_schedules and tir_passes ? (Im asking mainly because they will all be S-TIR → S-TIR IRModule passes).

Following from the above question, is there an ambition to handover S-TIR back to the core compiler ?

Following up on Mark’s comments,

Mark, we are quite looking forward for the RFC for this, especially related to reference-level explanation to see where this work is headed – which I believe might be better to know in this mutual interest of structuring BYOC targets.

However, I think we all share the ambition to replace kCompiler strings to be targets if can get more support from the community.

Our current PoC implementation uses KCompiler Attributes and the Standard MergeComposite, AnnotateTarget, MergeCompilerRegions.

The current plan is to move to the collage implementation by @mbs-octoml as soon as possible which would move partitioning into the relay.build.

We discussed this at the TVM Community Meeting this morning. There was a presentation about the approach followed by some discussion. Thanks @MJKlaiber @cgerum @SebastianBoblestETAS @paulpb @PhilippvK @r.stahl @aca88 for bringing this to the meeting!

Here are some notes (please feel free to correct them if I got anything wrong!):

  • The current graph partitioning approach is the same one that’s used in the compiler today. It’s compatible with the collage partitioning which is in the works and not yet RFC’d.

  • Would the v1 support Tensor Expression (TE), or are we skipping that?

    • Mikael understands CreatePrimFunc can support TE so should be natively supported
    • Paolo: using standard lowering as is done by Ethos-U
  • Proposal has an explicit differentiation between S-TIR adn NS-TRI. WOuld there be different hooks? e.g. here we can register TIR scheduling passes vs TIR passes.

    • Will it be possible to contribute S-TIR back to the compiler or just NS-TIR?
      • Scheduling passes work on S-TIR; passes in the boxes behind the schedules are injected into the lowering by pass context. Passes do not return S-TIR. They are part of the lowering from S-TIR to NS-TIR. At the moment, calling tvm.lower() and injecting those passes in to tvm.lower()
  • In Relay-to-TIR hook, already trying to figure out the lowering order, which might not match parittioning order. Want to see memory available after compiling c functions but before lowering Ethos-U functions. Any thoughts on whether it’s possible to configure the order of partitioning in this flow?

    • Why? Need to see the amount of live memory available after running the default TVM flow.
    • Relay passes can see the whole IRModule, past that only functions for a particular target are seen by a TIR pass.
    • The order needs to be decided and it varies by registration point.
  • Q: Are there common accelerator passes that are in use in TVM, or does everyone do something different?

    • There are common touch points, those are the “plumbing” mentioned in this slide presentation. e.g. Graph partitioning, scheduling, code-generation.
    • UMA isn’t trying to box anyone into a particular flow, instead it’s just trying to suggest one way doing this from a broader set of options to serve as a guide for folks who may be new to TVM.
  • Question from Federico, who is integrating an accelerator of his own.

    • VTA uses memory scopes to define buffers in block-ram. Are we planning to accommodate that in UMA?
      • You could write your own schedules and passes to do this. storage_scope is kind of the way to do this at the runtime level. You can also leverage USMP to define memory pools and use it as a pass to schedule.
3 Likes

Thanks everyone for the detailed input and feedback!

To keep track of the latest version of the UMA pre-RFC and to add the great suggestions that we got from this discussion thread, I created a document in our tvm-rfc fork :

CC: @areusch @mbs-octoml @jroesch @cgerum @paulpb @PhilippvK @r.stahl @aca88 @SebastianBoblestETAS @manupa-arm

thanks! feel free to open an RFC PR and we can iterate there if you like.

1 Like

PR in TVM-RFC:

1 Like

Hi community,

we are going to present the progress on the UMA RFC in today’s TVM community meeting.

Most important discussion points during the RFC review phase:

  • Target attributes
  • Phase naming: int vs enum
  • Interaction/Overlap with Collage

Thanks for the great discussion and input @areusch @manupa-arm @mbs-octoml @lhutton1 @sunggg !

Concrete next steps are tracked in this issue:

CC: @tqchen @SebastianBoblestETAS @aca88 @UlrikHjort @Khoi @lhutton1 @sunggg

Tracking issue:

https://github.com/apache/tvm/issues/11260

Michael, I’ve tested the uma cli test script for the vanilla mockup.
Now I would compile my TFLite model with the UMA backend
Could you share a sample script?

I first loaded a model using: mod = tvmc.load(“model.tflite”)
Then create uma backend, registered it.
Then passed the model to uma_backend.partition() but got multiple errors.

could you post the code and the error messages you are getting?

CC: @cgerum @paulpb

Here's a sample code:

mod = tvmc.load(r"/shared/model.tflite")
mod.summary()

uma_backend = VanillaAcceleratorBackend()
uma_backend.register()
mod = uma_backend.partition(mod)
target = tvm.target.Target("vanilla_accelerator", host=tvm.target.Target("c"))

package = tvmc.compile(model, target=target)
result = tvmc.run(package, device=device)
print(result)


Got the following error:

Traceback (most recent call last): File “/shared/run_custom.py”, line 107, in main() File “/shared/run_custom.py”, line 76, in main mod = uma_backend.partition(mod) File “/usr/uma/python/tvm/relay/backend/contrib/uma/backend.py”, line 299, in partition return self._relay_to_relay.partition(mod, params) File “/usr/uma/python/tvm/relay/backend/contrib/uma/api/partitioner.py”, line 96, in partition mod = relay.transform.InferType()(mod) File “/usr/uma/python/tvm/ir/transform.py”, line 161, in call return _ffi_transform_api.RunPass(self, mod) File “/usr/uma/python/tvm/_ffi/_ctypes/packed_func.py”, line 223, in call values, tcodes, num_args = _make_tvm_args(args, temp_args) File “/usr/uma/python/tvm/_ffi/_ctypes/packed_func.py”, line 188, in _make_tvm_args raise TypeError(“Don’t know how to handle type %s” % type(arg)) TypeError: Don’t know how to handle type <class ‘tvm.driver.tvmc.model.TVMCModel’>

I modified the code and loaded the TFLite model as done in the TVM from_tflite.py example. Then replaced the generation of “mod” in create_conv2d() in the run.py example Now getting another error. It seems that vanilla accelerator is not recognized by the scheduler

1: tvm::relay::OpImplementation::Schedule(tvm::Attrs const&, tvm::runtime::Array<tvm::te::Tensor, void> const&, tvm::Target const&) 0: tvm::runtime::PackedFuncObj::Extractor<tvm::runtime::PackedFuncSubObj<TVMFuncCreateFromCFunc::{lambda(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)#2}> >::Call(tvm::runtime::PackedFuncObj const*, tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*) [clone .cold] File “/usr/uma/python/tvm/_ffi/_ctypes/packed_func.py”, line 81, in cfun rv = local_pyfunc(*pyargs) File “/usr/uma/python/tvm/relay/op/strategy/generic.py”, line 114, in schedule_reduce return topi.generic.schedule_reduce(outs) File “/usr/uma/python/tvm/topi/generic/nn.py”, line 597, in schedule_reduce return _default_schedule(outs, True) File “/usr/uma/python/tvm/topi/generic/default.py”, line 28, in default_schedule raise RuntimeError(“schedule not registered for ‘%s’” % target) RuntimeError: schedule not registered for 'vanilla_accelerator’