Which is the best way to port tvm to a new AI accelerator?

wda · June 8, 2020, 10:08am

Now, I think there are four ways to port tvm to a new AI accelerator.

1 ： BYOC, BYOC can offload the ops to your new device which your new device support. BYOC is simple and graceful, But we can’t use AutoTVM in BYOC. I think AutoTVM is the very import feature of TVM.

2 : Tensorize, By using TVM’s schedule primitive Tensorize, we can replace a unit of computation with the corresponding intrinsic, such as GEMM instruction. We can use AutoTVM in this way, but we may need to use tensorize to modify very ops’s schedule.

3 : like cuDNN, we can use tvm to call new device like use cuDNN to call GPU. this way is not better than BYOC

4 : like GPU/CPU, we can add a new target in tvm like GPU/CPU, we need develop compute and schedule for every op, we also need to develop graph optimize for this new device. we can use AutoTVM in this way. But this way is the most time-consuming and the most difficult

I think if we only have the op level’s api of new device, BYOC is the best way.

If we have ISA level’s interface of new device, which way is the best?

@tqchen @thierry @zhiics @comaniac @manupa-arm

comaniac · June 8, 2020, 4:57pm

From the system infra’s perspective, I would recommend to use BYOC for any domain-specific accelerators.

From the implementation’s perspective, this would depend on your accelerator compiler. In short, if your compiler is based on DNN operators (e.g., conv2d, dense, etc), then BYOC is naturally fit. If your compiler is based on program statements (e.g, loop, if, etc), then it should be easier to be another TVM target, although we can discuss more to see if we can make this case work with BYOC as well.

For AutoTVM, all BYOC use cases we have met so far only need Relay IR with operator names and attributes for their tool chain to perform in-house optimization. In other words, their compilers will schedule operators by themselves so they do not need AutoTVM to do so.

Could you elaborate more about how would you expect to use AutoTVM in BYOC to support your accelerator?

wda · June 9, 2020, 2:00am

Thanks for your reply.

we would like to use AutoTVM to tune the op/graph to get better performance for different workloads. With AutoTVM, Maybe we can get better performance than native ops which implement by ourselves, like now AutoTVM can get better performance than cuDNN. To be clear, we have ISA level interfaces of new device.

comaniac · June 9, 2020, 5:31am

If your accelerator has an ISA that does not specialized for tensor processing but can execute any programs, then it’s more like a processor such as CPU and GPU. In this case, it’s intuitive to make it as another target, because you probably don’t need other targets to cover unsupported ops.

However, if this is a case, you will need to have corresponding TOPI schedules and Relay op strategy for your target. This is inevitable anyhow whatever you make it as another target or use BYOC.

wda · June 9, 2020, 6:14am

I think we need CPU to conver ops which new AI accelerator doesn’t support. Although we have ISA level APIs, we also need time to develop for new ops. I guess that now tvm can run a model in different devices without BYOC.

comaniac · June 9, 2020, 5:24pm

I see. In summary, your accelerator has the following characteristics:

It has an ISA.
The compiler for your accelerators accepts assembly code of your ISA without providing op-level APIs, so you need an additional layer on the top of your compiler for DNN ops. It seems similar as TOPI to TVM.
Your ISA is not general enough to represent any operators, so you still need CPU or GPU as fallbacks.
Your ISA and accelerator are flexible so you may need to optimize the assembly code for an operator, and you wish this optimization can be done with AutoTVM.

Accordingly, here are my two cents:

The current BYOC simply provides Relay subgraphs with Relay ops to your codegen. Since this is on the top of TOPI and TVM, you will not see any TOPI schedule or lowered TVM IR. We made this decision because most DNN accelerator developers prefer to take a sub-model (Relay subgraph) as an input of their codegen so that they can optimize each operator without knowing other details in TVM.
Since you will need to wrtie new ops anyways (unless you can run all TVM supported ops), it is reasonable to optimize the op for your accelerator beforehand. As a result, you can guarantee the performance and the compilation time.
In case you really want to make your accelerator as another TVM target and you also wish to use CPU as the fallback, you may need to extend the current CPU/GPU heterogeneouse execution developed by @zhiics.

kevinthesun · June 9, 2020, 8:06pm

Will VTA experience help in this case?

comaniac · June 9, 2020, 8:14pm

Yeah. VTA is one of the TVM targets.

dmakarov · June 11, 2020, 1:44pm

Is there an example of using a custom codegen. The BYOC documentation page currently states The tutorial for end-users to annotate and launch a specific codegen is here (TBA), but doesn’t refer to anything.

comaniac · June 11, 2020, 3:35pm

The tutorial you referred includes a custom codegen. We are working on another one that generates graphs in JSON format and plan to file a PR in 2 weeks. After that, hopefully we can find some time to update the annotation section in the tutorial.

dmakarov · June 11, 2020, 3:50pm

Doesn’t the second part of that tutorial Implement a Codegen for Your Representation cover the generation of graphs in json? It would really help to have at least some hints how the custom code generator could be invoked. I’m working on developing such a code generator and a bit stuck without knowing how to actually start using it. It looks like I need to figure out how the entire pipeline works before I can figure how to invoke my code generator. Without this step, the tutorial is not very useful at all.

comaniac · June 11, 2020, 3:55pm

That’s exactly the reason we started implementing a real JSON codegen and runtime. The example JSON codegen in the tutorial is not in the code base, so many people were asking for an end-to-end flow like you. Our plan is to put a complete JSON codegen and an example runtime that takes graphs in JSON format to the code base to demonstrate how it works.

cbalint13 · June 13, 2020, 9:18pm

@wda, @dmakarov,

My early experience on the topic:

I also work on implementing a codegen for a custom ISA (kind of quantized precision ISA [1]). I choose to do it as a new target and not BYOC. I just created a new target, cloned the codegen_c.cc and adapted many of handlers/serializers there (overrides) to reach my goals and emit .asm code. Asm then is compiled with external compiler anyway. It would be difficult to write a doc, even a generic one, simply getting hands on the code and play with it will do the job. Yes there will be need of schedule strategy for the target for some layers but that will come later.

[1] https://github.com/SymbioticEDA/MARLANN/blob/master/docs/isa.md

I think @comaniac explained well the pros/cons going BYOC or new custom target.

dmakarov · June 15, 2020, 8:42am

Thank you for sharing, @cbalint13. Yes, I too started adding a new target, mostly because I can use an existing target as a guiding example. However, in my case the target can run TVM supported operations. So it seemed that BYOC would be a better option. Therefore I’m very eager to see a concrete example of using a BYOC. The tutorial as of now is more of a teaser, and I really hope to see the end-to-end flow soon. Also, switching back and forth between python and c++ doesn’t make it easier to understand how the infrastructure operates.