Hi all, we design an NPU for our SOC which is not like VTA, how should we do to run TVM on this AI core? this NPU has its own ISA, but we have not yet implemented any LLVM backend for it. anyone can give the suggestion(guide) that the main steps(tasks) to enable TVM running on the chip? Thanks for any reply!
I assume you don’t have your own NPU accelerate library, so you couldn’t go TVM BYOC.
Firstly, you should implement your own quantization algorithm based on your NPU (not all operation / data type could be provided on your NPU, like int64)
Secondly, you should consider provide your own relay graph passes for better support on your chip (for example, your NPU have some restrict on operator support / your NPU have own data layout support and so on)
Thirdly, you should implement your NPU’s TVM passes and (maybe) your own schedule primitive (for example your NPU own memory memory hierarchy)
Fourthly, you should complete code generation. You mention you don’t have LLVM BE currently. All right, you should consider implement your own code generation for TIR (for example emit assembly instruction directly).
Fifthly, you should communicate with NPU driver team and co-design API so that we know how to load the compiled binary. This will impact how we design or change TVM runtime component to suit for your NPU.
Thanks for your reply so much! But I’m a little confused about your answer, sorry for my ignorance about DNN compiler.
Firstly, you should implement your own quantization algorithm…
what’s mean about “quantization algorithm”, and what for?
- Is there any prject or example about adding new backend for reference? Thanks again!
See this blog post about BYOC: How to Bring Your Own Codegen to TVM
For more examples, you can search “BYOC” in pull request titles of the upstream TVM. Usually the PR with title
[BYOC] XXX integration is the one that introduces a new backend.
thanks for your suggestion! I will read the tvm source code for more information.
Thanks for your reply!
Could you please give some clue that which functions are related to code generation for TIR, or is there any example in current tvm source code project?
@chiuchiu ， Hello， have you solved this problem? I want to get some suggestions
Hello, We are also working on this. Perhaps we can discuss and learn this progress together?
I apologize for missing this thread earlier, but if you guys have some common questions, it would be great for us to collect them to help us improve our documents. if you guys have specific questions here–i’ll try to answer them–so that we can begin building better documentation for folks in your position.
Hi, thank you very much for your willingness to help. I will illustrate my situation and clarifiy some problems others may also have.
My task and our DSP chip: I am trying to bring a self-designed DSP chip into TVM. The chip itself is based on SIMD and VLIW, and cooperates with an ARM cpu which is capable of running linux. As the backend, we have an IR implementation and corresponding tool to gengerate machine code from the IR (not realized using llvm). A runtime library has also been developed for dispatching tasks.
My plan: After some invesgation (the discussions in the community is helpful), I find the BYOC is not suitable for my situation. Because our dsp does not have a high-performance op-complier and I want to use the auto-scheduler compoment in TVM. So I think I need to bring our DSP into TVM just like GPU.
The entire project mainly contains three parts:
Bring our complier into tvm. To realize this, we need to translate the complied tir into our IR representation, and we are reading the source code (especially codegen_llvm.cc) to understand how to transform tir to another IR. We have least problems in this part.
Optimize both function-level and op-level passes. We are now analyzing the functions of these passes and trying to figure out whether a pass is applicable on our DSP. As the first setp, we intend to only keep the necessary passes to transform relay into tir (to get through the process from DL models to machine code asap combined with our task 1), and then add DSP-related optimization passes one by one. In this task, a lite pass sequence which does little optimization is very helpful.
The runtime part. Once we have finished the first two parts, the last thing left is the runtime system. We know we need to realize the device api interface and runtime module just like the GPU (GPU cuda runtime ). We are studying the code (the multi-thread part is a little confusing) and more detailed documation is welcomed.
As a summary, except BYOC, I haven’t found a detailed document about how to bring a new device into TVM like GPU. I think the most needed information of us (device producers) in such a document is the clear steps, including what we need to realize in each part and where these parts lie in the tvm. I believe such a document will do great benefit and expand the ecosystem of tvm thourgh attracting more chip vendors.
@shiy10 thanks for the information and apologies for the lack of documentation. However, I think you guys seem to be on the right track here.
With regards to the middle piece: this reminds me a bit of the Ethos-U work, where BYOC was also insufficient because it did not allow use of compiler passes or automation. For that work, a BYOC-like flow was recently added which allows folks to customize the Relay-to-TIR transformation and the TIR-to-machine-code transformations while still allowing core compiler passes to operate on the TIR. And, the standard Relay-to-TIR pass was extracted into TECompiler. You may have already seen this, but wanted to point it out here in case you haven’t.