How to effectively retarget/port TVM to new AI accelerators?

Hi,

This point I guess has been touched in the past but I would like to bring it again.

Currently there is a wide number of SoC vendors incorporating more and more new AI SoCs together with SDKs that aim at helping deploying models on these devices. Unfortunately, in many cases these SDKs are still far from mature (e.g., only few operators supported) and only support a few simple typical CNNs models.

Of course, in these cases TVM become an attractive option as anyway one of its main and initial goals is to close this software gap. However, it looks like retargeting TVM to these new accelerators is not a trivial task. For example, Qualcomm offers an SDK (Snapdragon Neural Processing Engine - SNPE) but they have been also working on regarting TVM to the Hexagon since May last year I believe.

So I was wondering if there are already efforts or guidelines to make the process of regarting/porting more manageable?

Moreover, I have some concrete doubts about the process of retargeting TVM:

  1. Is it possible to directly use the OpenCL target in TVM if the AI accelerator supports it?
  2. Does the AI accelerators must be programmable to allow TVM to target them (e.g., DSPs)?. Or is possible to target other AI accelerators with a limited interface (e.g., ASIC designs)?
  3. Is uTVM one way to target new accelerators which in general do not have a fully fledged OS or have a minimal RTOS?

Your input and thoughts are highly appreciated

@tqchen @thierry

2 Likes

I could try to answer some questions because I did some work of Qualcomm Hexagon support and could share experience.

Question 1, I think yes.

Question 2, depends on how you use TVM. For example, if we want to support hexagon dsp, if you want to support it in TVM natively, you should implement the runtime system of Hexagon in TVM (for example, implement the DeviceAPI interface), you should also need to complete the codegen part of Hexagon. As we have external codegen part now, you could use TVM combine with Qualcomm NNLib. SNPE is not one ideal candidate, because SNPE doesn’t expose the api of graph level.

1 Like

Considering our recent efforts on bring your own codegen to TVM:

  1. You can, but it would be a potential performance issue if you directly use existing TOPI schedules, because those schedules were designed for other targets like ARM GPU.

  2. With BYOC, your AI accelerator can either support C/C++ APIs like MKLDNN or just ISA. You can refer to this tutorial for details about how to integrate your AI accelerator compiler to TVM.

  3. I am not familiar with uTVM so I’ll pass this one to @thierry :smiley:

1 Like

Agree with @tico, targeting a new accelerator is not a trival task. In my opinion the main culprits are (1) lack of clear documentation on steps to perform this task – e.g. extending device api, CodeGen which isn’t C style, etc – and (2) low-level IR schedule transformations not being flexible enough to directly CodeGen from for the accelerator without requiring custom passes.

Thanks @comaniac for the reference to the tutorial, it looks very interesting at first glance.

Are there still open discussions/developments regarding “bring your own codegen to TVM” that you could point me to?

BYOC is a new feature that we have been worked for a while. We actually just finish merging the feature PR and the tutorial in the past two weeks so it doesn’t have many resources yet. All available features so far are already described in the tutorial.

Meanwhile, the last missing piece is a user interface for accelerator providers to indicate which parts of a CNN model could be offloaded to the accelerator. The discussion is [RFC] naming and API signature for external compilation. Regarding this feature, [RFC][External Codegen] Defining 'Composite' Relay operators, is proposing an approach for providers to specify composite Relay patterns. You are welcome to give it a try with the tutorial and let us know if you have any feedbacks.

1 Like

Hi @comaniac ,

I have a follow up question regarding BYOC

If I have a compiler for my accelerator, is this codegen tutorial meant to guide on how to cross-compile the entire model to the accelerator, or the idea is the you have a host that offloads some operators to the accelerator (e.g., Conv2d)?. I mean like in the case of a qualcomm SoC in which the master is the ARM and then some operators are offloaded to the Hexagon DSP?

@weberlo I wanted to ask you if uTVM plays any role is expected in the BYOC feature to target accelerator which in general do not have an OS. I am bit confused about the differences in use cases for the BYOC and uTVM.

Hi @tico, I’ve also been working with the BYOC infrastructure recently. The ‘annotation’ mechanism (the part which says which operators should be sent to which device/compiler) is not there yet. You can sort of do it manually but it’s very painful. Hopefully in the next week or so we’ll have an RFC on an annotation mechanism and can produce an accompanying tutorial.

Once that functionality is in, yes you’ll be able to offload parts of the network to an accelerator and have the rest run on CPU.

1 Like

Great to hear that the offloading feature is WIP!

I was wondering how does the API/interface with the device/compiler looks like? Is a library provided by the vendor of the accelerator required or just a cross-compiler for the accelerator and some communication API is enough?.

Looking forward to the RFC and the tutorial!

The graph partitioner will take the entire Relay graph representing the network and create subgraphs from that which can be compiled via your external compiler. You then need to create a pass which consumes that Relay subgraph and translates it to something your external compiler can understand and then compile (you can write this directly in TVM). src/relay/backend/contrib/dnnl/codegen.cc shows an example of this (although it sounds like your case may be a bit more complex).

Ok understood, I will have a look that the file that you mentioned.

I have one more quick question: is AutoTVM going to be supported to optimize offloaded operators?

It’s not something I’ve seen any proposals for yet. However, my assumption would be no because AutoTVM acts on TVM functions (that is, operators that go through TVM’s code generation).

For the detail about the API/interface for vendor compilers, it is suggested to take a look at the tutorial.

For AutoTVM, we do not support auto-tuning for BYOC now. LIke @matt-arm has mentioned, AutoTVM is targeting to TVM functions. One sentence for AutoTVM is: finding the best config from a tuning space defined in a given TVM schedule function (e.g., TOPI schedule). In other words, AutoTVM cannot figure out a tuning space if the schedule implementation is not in TVM schedule primitives.

In long term, we might plan to propose a representation for vendors to specify a tuning space so that we can leverage AutoTVM to tune the performance for external codegen as well, but we currently do not prioritize this task due to the lack of bandwidth and driving applications.

Ok, now I see the challenge regarding the AutoTVM support.

Now, after looking at some answer from @tqchen in the following post, I was wondering what are the differences when it comes to retarget TVM in terms of use cases and effort to implement between using BYOC vs creating a new backend in the target/source/ directory?

One of the major difference is that BYOC allows vendors to only generate code (mainly wrappers) that can be understood by their own backends without really exposing the backend/library details. For example, you can register your contrib codegen to TVM and generate a wrapper conv2d_ to call your own library. And you create a simple runtime to interpret the generate artifact, when you see conv2d you will invoke your own kernel.

This is different from what we have under src/target/source which are TVM compatible codegen tools and the generated code can be understood by TVM runtime for execution. This is because we have all added the schedules and computes for such kernels.

I have a follow up question regarding BYOC.

If I create a wrapper to call a Conv2D from my own library, should I also implement all other operators required by my model as well? or the existing backends can generate the code as usual for the other operators?. In other words I am wonder how can I actually execute parts of a Relay graph using my own library (e.g., Conv2D operators) and the rest using standard backends (e.g., LLVM for ARM/x86). From the tutorial of BYOC this is not yet clear to me.

Thanks!

Yep, you can use both your own library and TVM codegen for the rest :slight_smile:

Ok thats great as this is of course what makes the most sense but I was not 100% sure :slight_smile:

Is there any example that shows how this looks like using an actual model lets say in Tensorflow? I guess question goes in the direction of if there is any concrete example that shows how the partitioning is used?

We are discussing the way of annotating supported operators in [RFC] Op based annotation for external codegen. You are welcome to provide your thoughts :slight_smile: