[RFC] UMA: Universal Modular Accelerator Interface

You are right, my apologies. I’ll edit the original post.

@areusch @paulpb is there going to be a discussion about this feature, perhaps on a community meeting? I would like to be there, I think this feature will greatly help the future integration of accelerators, something I am extremely interested in.

1 Like

@fPecc, Andrew @areusch has agreed to put it on the agenda of the next community meeting. Would be great to have as many interested community members there as possible to collect requirements and find a sweet spot for the API :+1:.

2 Likes

Hi @MJKlaiber ,

Apologies for not getting back to this in time. Thanks for the proposal! and it broadly looks like wrapping the Target Hooks RFC (by @Mousius ) : https://github.com/apache/tvm-rfcs/blob/main/rfcs/0010-target-registered-compiler-flow-customisation.md, and exposing a nice/structured interface to python. It is nice to see progress on this :slight_smile: .

I would like to suggest potential text changes for the formal RFC to those of us who are familiar with the existing flow (specially around naming).

Maybe it is worth mentioning these are current implemented as partition_for_<(backend\target)> ?

I am a bit curious, why this interface is specifically positioned as an “accelerator” (as in UMA) partitioner though ? i.e. Would it not be used to support optimized library support as we currently have today with BYOC ?

Since the proposal suggests to use the properly registered targets, any reason should we stick to target_name (str) as opposed to the actual TargetKind ?

Following up on the above question, what are your thoughts on moving the UMAPartitioner inside relay.build(…) ?

Also this seemed to be proposed on using S-TIR (as opposed to “legacy” TE->TIR pipeline), would you be able to share the motivation to the partitioning of tir_schedules and tir_passes ? (Im asking mainly because they will all be S-TIR → S-TIR IRModule passes).

Following from the above question, is there an ambition to handover S-TIR back to the core compiler ?

Following up on Mark’s comments,

Mark, we are quite looking forward for the RFC for this, especially related to reference-level explanation to see where this work is headed – which I believe might be better to know in this mutual interest of structuring BYOC targets.

However, I think we all share the ambition to replace kCompiler strings to be targets if can get more support from the community.

Our current PoC implementation uses KCompiler Attributes and the Standard MergeComposite, AnnotateTarget, MergeCompilerRegions.

The current plan is to move to the collage implementation by @mbs-octoml as soon as possible which would move partitioning into the relay.build.

We discussed this at the TVM Community Meeting this morning. There was a presentation about the approach followed by some discussion. Thanks @MJKlaiber @cgerum @SebastianBoblestETAS @paulpb @PhilippvK @r.stahl @aca88 for bringing this to the meeting!

Here are some notes (please feel free to correct them if I got anything wrong!):

  • The current graph partitioning approach is the same one that’s used in the compiler today. It’s compatible with the collage partitioning which is in the works and not yet RFC’d.

  • Would the v1 support Tensor Expression (TE), or are we skipping that?

    • Mikael understands CreatePrimFunc can support TE so should be natively supported
    • Paolo: using standard lowering as is done by Ethos-U
  • Proposal has an explicit differentiation between S-TIR adn NS-TRI. WOuld there be different hooks? e.g. here we can register TIR scheduling passes vs TIR passes.

    • Will it be possible to contribute S-TIR back to the compiler or just NS-TIR?
      • Scheduling passes work on S-TIR; passes in the boxes behind the schedules are injected into the lowering by pass context. Passes do not return S-TIR. They are part of the lowering from S-TIR to NS-TIR. At the moment, calling tvm.lower() and injecting those passes in to tvm.lower()
  • In Relay-to-TIR hook, already trying to figure out the lowering order, which might not match parittioning order. Want to see memory available after compiling c functions but before lowering Ethos-U functions. Any thoughts on whether it’s possible to configure the order of partitioning in this flow?

    • Why? Need to see the amount of live memory available after running the default TVM flow.
    • Relay passes can see the whole IRModule, past that only functions for a particular target are seen by a TIR pass.
    • The order needs to be decided and it varies by registration point.
  • Q: Are there common accelerator passes that are in use in TVM, or does everyone do something different?

    • There are common touch points, those are the “plumbing” mentioned in this slide presentation. e.g. Graph partitioning, scheduling, code-generation.
    • UMA isn’t trying to box anyone into a particular flow, instead it’s just trying to suggest one way doing this from a broader set of options to serve as a guide for folks who may be new to TVM.
  • Question from Federico, who is integrating an accelerator of his own.

    • VTA uses memory scopes to define buffers in block-ram. Are we planning to accommodate that in UMA?
      • You could write your own schedules and passes to do this. storage_scope is kind of the way to do this at the runtime level. You can also leverage USMP to define memory pools and use it as a pass to schedule.
3 Likes

Thanks everyone for the detailed input and feedback!

To keep track of the latest version of the UMA pre-RFC and to add the great suggestions that we got from this discussion thread, I created a document in our tvm-rfc fork :

CC: @areusch @mbs-octoml @jroesch @cgerum @paulpb @PhilippvK @r.stahl @aca88 @SebastianBoblestETAS @manupa-arm

thanks! feel free to open an RFC PR and we can iterate there if you like.

1 Like

PR in TVM-RFC:

1 Like

Hi community,

we are going to present the progress on the UMA RFC in today’s TVM community meeting.

Most important discussion points during the RFC review phase:

  • Target attributes
  • Phase naming: int vs enum
  • Interaction/Overlap with Collage

Thanks for the great discussion and input @areusch @manupa-arm @mbs-octoml @lhutton1 @sunggg !

Concrete next steps are tracked in this issue:

CC: @tqchen @SebastianBoblestETAS @aca88 @UlrikHjort @Khoi @lhutton1 @sunggg

Tracking issue:

https://github.com/apache/tvm/issues/11260

Michael, I’ve tested the uma cli test script for the vanilla mockup.
Now I would compile my TFLite model with the UMA backend
Could you share a sample script?

I first loaded a model using: mod = tvmc.load(“model.tflite”)
Then create uma backend, registered it.
Then passed the model to uma_backend.partition() but got multiple errors.

could you post the code and the error messages you are getting?

CC: @cgerum @paulpb

Here's a sample code:

mod = tvmc.load(r"/shared/model.tflite")
mod.summary()

uma_backend = VanillaAcceleratorBackend()
uma_backend.register()
mod = uma_backend.partition(mod)
target = tvm.target.Target("vanilla_accelerator", host=tvm.target.Target("c"))

package = tvmc.compile(model, target=target)
result = tvmc.run(package, device=device)
print(result)


Got the following error:

Traceback (most recent call last): File “/shared/run_custom.py”, line 107, in main() File “/shared/run_custom.py”, line 76, in main mod = uma_backend.partition(mod) File “/usr/uma/python/tvm/relay/backend/contrib/uma/backend.py”, line 299, in partition return self._relay_to_relay.partition(mod, params) File “/usr/uma/python/tvm/relay/backend/contrib/uma/api/partitioner.py”, line 96, in partition mod = relay.transform.InferType()(mod) File “/usr/uma/python/tvm/ir/transform.py”, line 161, in call return _ffi_transform_api.RunPass(self, mod) File “/usr/uma/python/tvm/_ffi/_ctypes/packed_func.py”, line 223, in call values, tcodes, num_args = _make_tvm_args(args, temp_args) File “/usr/uma/python/tvm/_ffi/_ctypes/packed_func.py”, line 188, in _make_tvm_args raise TypeError(“Don’t know how to handle type %s” % type(arg)) TypeError: Don’t know how to handle type <class ‘tvm.driver.tvmc.model.TVMCModel’>

I modified the code and loaded the TFLite model as done in the TVM from_tflite.py example. Then replaced the generation of “mod” in create_conv2d() in the run.py example Now getting another error. It seems that vanilla accelerator is not recognized by the scheduler

1: tvm::relay::OpImplementation::Schedule(tvm::Attrs const&, tvm::runtime::Array<tvm::te::Tensor, void> const&, tvm::Target const&) 0: tvm::runtime::PackedFuncObj::Extractor<tvm::runtime::PackedFuncSubObj<TVMFuncCreateFromCFunc::{lambda(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)#2}> >::Call(tvm::runtime::PackedFuncObj const*, tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*) [clone .cold] File “/usr/uma/python/tvm/_ffi/_ctypes/packed_func.py”, line 81, in cfun rv = local_pyfunc(*pyargs) File “/usr/uma/python/tvm/relay/op/strategy/generic.py”, line 114, in schedule_reduce return topi.generic.schedule_reduce(outs) File “/usr/uma/python/tvm/topi/generic/nn.py”, line 597, in schedule_reduce return _default_schedule(outs, True) File “/usr/uma/python/tvm/topi/generic/default.py”, line 28, in default_schedule raise RuntimeError(“schedule not registered for ‘%s’” % target) RuntimeError: schedule not registered for 'vanilla_accelerator’

We’ll shortly provide an example to import an NN from an onnx/tflite file

Just added the things we discussed about UMA into a branch of the UMA RFC:

  • TVMC integration
  • Mock-accelerators for tutorial

Additional things are in early phase and are intended to enable an early discussion by the people interested in contributing or helping to form UMA

Feel free to comment here or in the PR:

CC: @areusch @SebastianBoblestETAS @aca88 @manupa-arm @cgerum @paulpb @PhilippvK @r.stahl @UlrikHjort @kslavka

1 Like

@MJKlaiber @areusch I’ve ran the latest uma test pipeline on a custom tflite model and would like to raise one issue. I checked out the latest TVM on main branch (SHA1 038f15b5e204120709186a8791e5b49986060bb0). Then ran tvm/tests/python/contrib/test_uma/test_uma_pipeline.py UMA successfully generated .c code, and here is the issue: The c code for convolution implementation is repeated for each convolution function.

e.g. tvmgen_default_vanilla_accelerator_main_0, tvmgen_default_vanilla_accelerator_main_1, … tvmgen_default_vanilla_accelerator_main_k

These functions contain the same convolution implementation code. Based on Michael’s RFC I assumed there would be multiple calls to the Vanila my_ai_hw_conv2dnchw() function with the relevant kernel and input sizes.

Please let me know what you think, whether this is the way TVM is built, or maybe I did some mistake in my setup. How can UMA generate a .c code that will call my custom convolution implementation (function call and not a duplicated c code)?

Thanks, Slava.

Hello, UMA is a great interface for custom accelerator vendors, which alleviates BYOC process a lot.

I’m building a workflow from a pre-trained model to the compiled c source for a backend (ARM core + custom accelerator). As our accelerator supports only int8/int16 operands, so I took a quantized onnx model (int8) into the frontend. From the relay graph, I see the pattern of interest would be “qnn.conv2d”. The

uma_backend.partition(mod)

was successful, but I met some errors by creating the PrimFunc. I’m not sure if you could provide any example for the quantized operator, since as far as I know, many custom accelerators are working in the low precision integer domain. Such an example definitely makes sense.

To register the operator strategy for “qnn.conv2d”. I used

wrap_compute_conv2d(topi.arm_cpu.conv2d_nchw_int8), wrap_topi_schedule(topi.arm_cpu.schedule_conv2d_nchw_int8),

But I’m not sure if this is the correct way.

I appreciate any hints from you. Chen

Hello Chen, Yes quantized operators are not directly lowerable to TIR, there are a few possibilities to handle this.

  1. Your approach is somewhat feasible, but it has the problem, that you are most likely ignoring the zero point / scale of your computation. If your hardware accelerator only supports a single value for scale and zero_point it might still be usable as is.
  2. Your approach can be augmented by adding the quantization parameters as attributes to the TE, for i need to refer you have to the ethos-u backend for examples, i hope i can provide a full fledged example shortly.
  3. You can run relay.qnn.transform.CanonicalizeOps() as a pre- or post-partitioning pass. In this case you do not need to register a custom operator strategy, but the generated TIR is much more complicated.

I personally would use option 2 at the moment.

2 Likes