[RFC] UMA: Universal Modular Accelerator Interface

Here's a sample code:

mod = tvmc.load(r"/shared/model.tflite")
mod.summary()

uma_backend = VanillaAcceleratorBackend()
uma_backend.register()
mod = uma_backend.partition(mod)
target = tvm.target.Target("vanilla_accelerator", host=tvm.target.Target("c"))

package = tvmc.compile(model, target=target)
result = tvmc.run(package, device=device)
print(result)


Got the following error:

Traceback (most recent call last): File “/shared/run_custom.py”, line 107, in main() File “/shared/run_custom.py”, line 76, in main mod = uma_backend.partition(mod) File “/usr/uma/python/tvm/relay/backend/contrib/uma/backend.py”, line 299, in partition return self._relay_to_relay.partition(mod, params) File “/usr/uma/python/tvm/relay/backend/contrib/uma/api/partitioner.py”, line 96, in partition mod = relay.transform.InferType()(mod) File “/usr/uma/python/tvm/ir/transform.py”, line 161, in call return _ffi_transform_api.RunPass(self, mod) File “/usr/uma/python/tvm/_ffi/_ctypes/packed_func.py”, line 223, in call values, tcodes, num_args = _make_tvm_args(args, temp_args) File “/usr/uma/python/tvm/_ffi/_ctypes/packed_func.py”, line 188, in _make_tvm_args raise TypeError(“Don’t know how to handle type %s” % type(arg)) TypeError: Don’t know how to handle type <class ‘tvm.driver.tvmc.model.TVMCModel’>

I modified the code and loaded the TFLite model as done in the TVM from_tflite.py example. Then replaced the generation of “mod” in create_conv2d() in the run.py example Now getting another error. It seems that vanilla accelerator is not recognized by the scheduler

1: tvm::relay::OpImplementation::Schedule(tvm::Attrs const&, tvm::runtime::Array<tvm::te::Tensor, void> const&, tvm::Target const&) 0: tvm::runtime::PackedFuncObj::Extractor<tvm::runtime::PackedFuncSubObj<TVMFuncCreateFromCFunc::{lambda(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)#2}> >::Call(tvm::runtime::PackedFuncObj const*, tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*) [clone .cold] File “/usr/uma/python/tvm/_ffi/_ctypes/packed_func.py”, line 81, in cfun rv = local_pyfunc(*pyargs) File “/usr/uma/python/tvm/relay/op/strategy/generic.py”, line 114, in schedule_reduce return topi.generic.schedule_reduce(outs) File “/usr/uma/python/tvm/topi/generic/nn.py”, line 597, in schedule_reduce return _default_schedule(outs, True) File “/usr/uma/python/tvm/topi/generic/default.py”, line 28, in default_schedule raise RuntimeError(“schedule not registered for ‘%s’” % target) RuntimeError: schedule not registered for 'vanilla_accelerator’

We’ll shortly provide an example to import an NN from an onnx/tflite file

Just added the things we discussed about UMA into a branch of the UMA RFC:

  • TVMC integration
  • Mock-accelerators for tutorial

Additional things are in early phase and are intended to enable an early discussion by the people interested in contributing or helping to form UMA

Feel free to comment here or in the PR:

CC: @areusch @SebastianBoblestETAS @aca88 @manupa-arm @cgerum @paulpb @PhilippvK @r.stahl @UlrikHjort @kslavka

1 Like

@MJKlaiber @areusch I’ve ran the latest uma test pipeline on a custom tflite model and would like to raise one issue. I checked out the latest TVM on main branch (SHA1 038f15b5e204120709186a8791e5b49986060bb0). Then ran tvm/tests/python/contrib/test_uma/test_uma_pipeline.py UMA successfully generated .c code, and here is the issue: The c code for convolution implementation is repeated for each convolution function.

e.g. tvmgen_default_vanilla_accelerator_main_0, tvmgen_default_vanilla_accelerator_main_1, … tvmgen_default_vanilla_accelerator_main_k

These functions contain the same convolution implementation code. Based on Michael’s RFC I assumed there would be multiple calls to the Vanila my_ai_hw_conv2dnchw() function with the relevant kernel and input sizes.

Please let me know what you think, whether this is the way TVM is built, or maybe I did some mistake in my setup. How can UMA generate a .c code that will call my custom convolution implementation (function call and not a duplicated c code)?

Thanks, Slava.

Hello, UMA is a great interface for custom accelerator vendors, which alleviates BYOC process a lot.

I’m building a workflow from a pre-trained model to the compiled c source for a backend (ARM core + custom accelerator). As our accelerator supports only int8/int16 operands, so I took a quantized onnx model (int8) into the frontend. From the relay graph, I see the pattern of interest would be “qnn.conv2d”. The

uma_backend.partition(mod)

was successful, but I met some errors by creating the PrimFunc. I’m not sure if you could provide any example for the quantized operator, since as far as I know, many custom accelerators are working in the low precision integer domain. Such an example definitely makes sense.

To register the operator strategy for “qnn.conv2d”. I used

wrap_compute_conv2d(topi.arm_cpu.conv2d_nchw_int8), wrap_topi_schedule(topi.arm_cpu.schedule_conv2d_nchw_int8),

But I’m not sure if this is the correct way.

I appreciate any hints from you. Chen

Hello Chen, Yes quantized operators are not directly lowerable to TIR, there are a few possibilities to handle this.

  1. Your approach is somewhat feasible, but it has the problem, that you are most likely ignoring the zero point / scale of your computation. If your hardware accelerator only supports a single value for scale and zero_point it might still be usable as is.
  2. Your approach can be augmented by adding the quantization parameters as attributes to the TE, for i need to refer you have to the ethos-u backend for examples, i hope i can provide a full fledged example shortly.
  3. You can run relay.qnn.transform.CanonicalizeOps() as a pre- or post-partitioning pass. In this case you do not need to register a custom operator strategy, but the generated TIR is much more complicated.

I personally would use option 2 at the moment.

2 Likes

Hi guys, We are also interested in using UMA for quantized models (int8) but I don’t know where to begin… The floating-point example provided for UMA was very helpful! Is it possible to have a similar example for quantized models? This could be useful for many others, since HW accelerators usually use quantized models and not float, so this would probably the more relevant use case.

@chen_liu, did you manage to progress with the ideas from cgerum? Any ideas and code snippets would be great!

Thanks, Koby

2 Likes

Let’s discuss this in the next UMA meeting.

@cgerum, what are your thoughts here?

Hi guys, can you please take a look at my question here:

I could copy it here as well, but I don’t guess it would be a good idea to have duplicated questions… Next time I’ll know it is better to post UMA related questions here :grinning:

Thanks, Koby

Hi guys, I tried running a quantized tflite model in the uma test, without any pattern matching for the Vanilla Accelerator (because we still couldn’t find the way to create a pattern that will catch “qnn.conv2d” operations) and the test always fails, no matter what model I’m trying… even for a very simple model with a single conv2d. It also runs about 10 times slower than a float model… For example: a very simple float model (2-3 conv2d layers) takes about 10 sec to run, but a similarl quantized model runs about 1:30 min… Any idea why does it run much more slowly and why could it fail when TVM generates the code itself (without UMA replacing function with the accelerator functions) ?

Thanks, Koby

Hi cgerum, thanks for the reply. I agree option 2 would be the most straightforward approach to address quantized ops. However, after some trials I still couldn’t figure out the relay op “qnn.conv2d”.

My accelerator has a parameter of “shift_bits” to handle the scales. For a quantized model, extracting zero point/scales is definitely required, but I’m not sure this step should be done in a registered relay pass or custom operator strategy.

I think it would be very very helpful if you or the UMA team could release the 2nd example addressing a quantized op (e.g. qnn.conv2d), the generated c API can be just a dummy function, as for custom accelerators, the most important thing is to extract parameters from relay op for accelerator configuration, in my opinion. If you are working on this, I will very much appreciate a rough time plan! Thanks, Chen

1 Like

Hi, I am looking for some help too with UMA, I couldn’t get a keras model to compile for the vanilla accelerator. There’s a potential bug as well so it might be worth checking it out.

Thanks @slai-nick for bringing up these kinds of questions.

@cgerum @kslavka @SebastianBoblestETAS @paulpb @PhilippvK @areusch @r.stahl, should we bring these kinds of things in an UMA meeting. Any interest?

3 Likes

@UlrikHjort @Khoi Yes, that is interesting. How frequently are UMA Meetings scheduled?

Best, Sebastian

1 Like

I’ve made some progress with this and I am now able to compile and run model from keras for the vanilla accelerator.

I am wondering now, how can I manage memory transfers with UMA in the case where the host and device are separate?

Another question I have for my particular case: my accelerator has a relatively small matrix multiplier core - most algorithms using it are going to call it in a loop - is there a way I can decompose operations so that they call the multiplier core on the tvm side, in order to be able to apply further optimisation passes on it? Or am I only able to provide pre-written kernel/routines for each operations in order to call my Matmul op?

Maybe we can start with once per month or so

we could use TVM Community Meeting for this as well if you like.

1 Like

Hi, I have read about your issues with adding a quantized accelerator with UMA. Now I have the same problems as you. I used the pattern in ethosu but II think it fails in the lowering step since it maybe can’t use the same tir pass as normal convolution for Vanilla. I was wondering if you could help me with this problem. And another question, was an example of quantized operations released? I could not find anything sofar.

Thanks in advance, Samira

Hi, I have read the stream about adding quantized operations but I still have some problems because I don’t know what exactly I should add to support that completely. I added the quantized conv pattern the same as ethosu example. I was wondering if an example of quantized operations has been released since I could not find anything so far.