Coordination of RISC-V Integration in TVM

PhilippvK · July 13, 2022, 8:57am

A few words about myself

I am a PhD student and have been working with TVM for over a year. My main field of research is TinyML and RISC-V based microcontrollers. Thus, I work on supporting bare metal 32-bit RISC-V targets (Actual hardware and simulators) using the MicroTVM Platform.

Background / Motivation

Over the past years, there has been a lot of interest in supporting RISC-V hardware in the TVM Compiler Suite. While some attempts have been made in the past, nothing has ended up upstreamed in the TVM repository. As several teams/individuals seem to be working in the same direction at the moment, it would make sense to align these efforts to reduce the risk of duplicate work and agree on some relevant design decisions beforehand.

What is currently supported?

While RISC-V is not yet officially supported by TVM, it can already be used using the default LLVM-based flow or the MicroTVM APIs (via the RISC-V QEMU running on the Zephyr Platform). However, the generated code (falling back to default or ARM schedules) is not yet optimized for the RISC-V ISA. Thus performance will not be optimal.

What is required?

In my opinion, the following baseline tasks:

Integrate support for RISC-V ISA simulation (ISS) in TVMs CI scripts and unit tests.
Decide how to detect the available RISC-V ISA extensions. Align this with the work done for ARM Cortex-M MCUs?
Add RISC-V specific schedules (default, packed (sub-word SIMD), and vector (super-word SIMD) in TVM: either starting with generating schedules or based on the current arm_cpu schedules. (See tracking issue: https://github.com/apache/tvm/issues/10141)
We also eventually need type legalizations, alter_op_layout, and the operator strategies
Generate AutoTVM logs for relevant models on RISC-V hardware and add them to the top hub (https://github.com/tlc-pack/tophub) repository. This should help to get better default performance for some use cases.

Past & Ongoing work

As it is quite hard to follow all the attempts to integrate RISC-V related features into TVM I‘ve tried to compile a list of efforts in that direction. Please let me know if I missed something relevant.

[RFC][Backend] RFC-CSI-NN2-Integration (@alter-xp)
- Integration of CMSIS-NN like vendor library by Alibaba’s T-Head bringing support for the RISC-V C906 (RVV0.7.1) Linux-capable hardware (e.g. Alwinner D1) and future RVV1.0 chips
- Following the BYOC approach
- Custom QUEMU build required for CI
- Discussion: How to determine the available extensions: -march (GCC) vs. -mcpu & -mattr (https://github.com/apache/tvm-rfcs/pull/75#discussion_r892910809)
- Pre-RFC: [pre-RFC][ BYOC] RISC-V CSI-NN2 Compute Library integration
- RFC PR: https://github.com/apache/tvm-rfcs/pull/75
- RFC: https://github.com/apache/tvm-rfcs/blob/main/rfcs/0075_RISC-V_CSI-NN2_Intergration.md
- Tracking Issue: https://github.com/apache/tvm/issues/11506
[RFC] Adding initial SVE implementation (@MeeraN7)
- Vectorization support for ARMs scalable vector extension has not landed yet in TVM.- However, there are efforts in a good direction. It is likely that some of these works will be helpful for RISC-V, as well as the RVV extension is quite similar to the SVE.
- RFC PR: https://github.com/apache/tvm-rfcs/pull/18
- Related PR: https://github.com/apache/tvm/pull/8655 (outdated)
[RFC] Enable TVM QNN on RISC-V with Subword SIMD Computation (@yrchen)
- This 2 year-old proposal by the NTHU intended to add SIMD support for RISC-V targets with the P-extension.
- It covers mainly optimized schedules and tensorization (vectorization) for the dot-product intrinsic
- A custom RISC-V (Spike) runtime was proposed as well. Unfortunately, nothing of the work got upstreamed.
- Pre-RFC: [RFC] Enable TVM QNN on RISC-V with Subword SIMD Computation
[μTVM] Deployment on GAP8 RISC-V platform (@JosseVanDelm)
- There has been some discussion on the integration of a RISC-V based accelerator into TVM
- Discuss post: [μTVM] Deployment on GAP8 RISC-V platform
- Followup: Feedback on TVM port to custom accelerator
muRiscvNN (@r.stahl @fabian me)
- Our reimplementation of ARMs CMSIS-NN kernel library for RISC-V targets supporting the latest spec of the packed (RVP) and vector (RVV) extensions and a CMSIS-NN compatible interface.
- Discuss Post: Integration of muRISCV-NN kernel library in TVM
In an older version of microTVM, a device called riscv_spike was introduced to run simulations using the riscv-isa-sim. Unfortunately, this was dropped over one year ago. It would be desirable to revive this feature sometime in the future (if possible)
In addition, some TVM-related talks have been held at some RISC-V events. Unfortunately, I could not find any implementation code online.
- Enabling TVM on RISC-V Architectures with SIMD Instructions: Enabling TVM on RISC-V Architectures with SIMD Instructions - YouTube
- Support TVM QNN Flow on RISC-V with SIMD Computation - Yi-Ru Chen & Jenq Kuen Lee: Support TVM QNN Flow on RISC-V with SIMD Computation - Yi-Ru Chen & Jenq Kuen Lee - YouTube
- Lightning Talk: Performance of TVM AutoScheduler for Andes Vector Processor - I-Wei Wu: Lightning Talk: Performance of TVM AutoScheduler for Andes Vector Processor - I-Wei Wu - YouTube

PhilippvK · July 13, 2022, 8:57am

CC @areusch @masahi @comaniac @Julien @Dream-math @Tonylyc @UCASHurui

areusch · July 13, 2022, 11:13pm

Thanks @PhilippvK for this summary! I agree it would be quite compelling to expand TVM’s capabilities to target the RISC-V ISA.

I think this might be one of the first steps we need to pursue in order to make it easier for folks to upstream their work. A minimal demo of this might look like the following:

A docker/install/ubuntu_install_spike.sh script to install SPIKE (or another RISC-V sim of choice) into our CI docker containers.
Add SPIKE to a new ci_riscv container by creating a new docker/Dockerfile.ci_riscv. Ensure the compiler in this container supports the RISC-V extensions we’d like to target.
A demonstration Project API which can be used to run simple tests on SPIKE. It should be possible today to just create a Project API implementation and re-target a simple add test e.g. test_aot_executor at the simulator (e.g. via just changing the path to the template project in _make_session.

Then, we can start tackling the question of implementing schedules or BYOC flows for specific CPUs more easily. What’re your thoughts here?

PhilippvK · July 15, 2022, 6:17am

The main issues is that spike does not provide any prebuild binaries. Would it be an big issue to compile the simulator from source inside the CI? Regarding other simulators: OVPsimPlus is pretty good, however you have to sign up up there homepage to get a download link.

Let me introduce the main issues here:

In the long term we want to support especially the P-Extension (packed, sub-word SIMD) and the V-Extension (Vector, super-word SIMD)
While the spec of the V extension is frozen since the end of last year the P extension is still evolving
Toolchain support:
- GCC: nothing made it into the main branch yet, there are separate WIP branches/PRs working on those extensions (Both are usable but not at the same time and of course they need to be compiled from source)
- LLVM: support for the latest vector extension (RVV) is available in LLVM 14, while the integration of the P extension (RVP) is still a work-in-progress. Thus we can not use LLVM to build programs with those instructions. Furthermore, for linkage we always need a GCC toolchain as the libc support seems to be missing in RISC-V LLVM.
We have to decide how we want to solve those issues? There are some approaches:
- Ignore LLVM for now and just use the GCC (two separate builds for RVP and RVV would be required here). This would probably we good enough for MicroTVM targets which anyway use GCC most of the time. LLVM support would mainly be interesting for the vectorization feature which doesn’t work anyway.
- Just stick with LLVM which at least has stable RVV support but drop opportunities to integrate the RVP in TVM until it is supported by LLVM (We still need a basic version of the GCC RISC-V toolchain for linkage, however this could be just downloaded in the CI).
- Mixed approach: Use LLVM for RVV and GCC for RVP. This way we would only need a single RISC-V GCC build.
@areusch What are your thoughts about the build process for the toolchain. It might be quite time-consuming when building the docker images. Would it make more sense to host a rebuilt version of the toolchains somewhere for downloading it from the CI?

Running spike though the MicroTVM Project API was easier than I expected. I will test it for a few more days and can open a PR for this in the future.

Update: We should also probably align with the work being done here by @alter-xp as this also will likely require a RISC-V GCC toolchain variant with vector support (Unfortunately the C906 only support an older version of the spec).

alter-xp · July 15, 2022, 11:40am

Interesting work！The good news is that our GCC(If difficulty in Chinese, you can download here) supports both RVP and RVV. However, only (version 0.9.3) RVP and (0.7.1 and 1.0) RVV are supported. If we use the old version of RVP and then switch to the latest version, the workload is acceptable. At the same time, if we want to use GCC, we may need to generate intrinsics instead of llvm IR. llvm also supports intrinsic compilation, which brings convenience to our future work.

JosseVanDelm · July 16, 2022, 2:42pm

Hi @PhilippvK,

Thanks for your initiative here and thanks for referencing my previous forum posts.

I think gathering efforts here makes a lot of sense, however, I should mention that now that things are clearer for me I don’t think the RISC-V aspect of those previous posts is that important in my work. We do have a RISC-V core in our own SoC (the one from pulpissimo), that acts as a host for driving the accelerators. However, the tiny core that comes with pulpissimo was never optimized/intended for any heavy lifting with regards to computation (we try to offload as many operations as possible to the accelerators). And for my current plan I don’t think we will ever optimize the calculations that are done on this tiny core itself. GAP8 might be interesting in this respect since it uses 8 RISC-V cores with specialized ISA-extensions for NN workloads in addition to the tiny core. However, in the end I’ve never deployed anything on GAP8 and I’m not planning to do so.

I think my biggest issue was getting something to come out of TVM in C-code that was standalone and could be compiled for microcontrollers (since we use the GCC compiler that ships with pulpissimo) but most of those problems have been resolved by the work of the uTVM folks in the past year, and our current work takes a lot of inspiration from the BYOC Ethos-U/CMSIS-NN work of the people from ARM and Linaro.

I don’t see us upstreaming any of our code anytime soon (you’re better of looking into the ARM stuff) but If anyone needs pointers or wants to hear about our experiences I’d be glad to reply!

Best regards!

PhilippvK · July 18, 2022, 7:08am

That sounds like a good approach to get started. I would’ve some technical questions regarding your T-Head-Semi GCC toolchain. Would it possible to contact you i.e. via mail to adress those?

alter-xp · July 18, 2022, 7:51am

hi @PhilippvK, you can contact us with email xp56@linux.alibaba.com. any questions about the toolchain are welcome.

areusch · July 19, 2022, 6:32pm

If you guys prefer a higher bandwidth forum for this, we could add to a TVM Community Meeting agenda. We meet just about every week and the main requirement to add things to the agenda is that there is a thread where notes can be taken. This thread is certainly enough.

Definitely not, as long as you build spike from within the Dockefile.ci_riscv. We already build stuff from source when we build docker containers, and we just use those pre-built containers in the CI, so you wouldn’t see this in pre-merge CI runtime. How long do you think this would take?

Glad to hear this, please tag me and I’ll review the PR as I have time (I am a bit behind right now ).

@alter-xp: what if we moved all the RISC-V related toolchains to a separate CI container, ci_riscv?

alter-xp · July 20, 2022, 2:17am

@areusch If ci_qemu’s work is very heavy, it is a good idea for us to separate ci_riscv. with the increasing content of rsicv, I didn’t think of any reason not to do so. and if the resources are enough, we can execute ci_riscv in parallel.

UCASHurui · July 27, 2022, 2:54am

@PhilippvK Greate work! I looked into this approach last year, but GCC and LLVM have not officially supported the RISC-V V extension (likely?). I postponed the project since then due to my limited resources and ability. I would definitely love to follow and contribute to this project. I think the toolchain provided by T-head is a good starting point mentioned by @alter-xp.

PhilippvK · July 29, 2022, 1:25pm

@alter-xp @areusch

Then I will wait with the integration of the Spike Simulator target until the Dockerfile for RISC-V is merged ([ci][docker] create Dockerfile.ci_riscv by alter-xp · Pull Request #12230 · apache/tvm · GitHub).

As long we do not have a complete GCC in the CI it should be fine. The compilation time of Spike should be only a few minutes on a quad core machine.

What we still have to agree on is which types of RISC-V processors we would like to consider for now. In my opinion it would make sense to support at least one 32bit MCU-like device (RV32GC) and a 64bit device capable of running Linux (RV64GC). This decision is relevant as we have to build the proxy-kernel (which does the semihosting) for each of those separately.

areusch · August 1, 2022, 5:04pm

Great, I’ll make an effort to sort out the CI rebuild mess this week. That should let this work proceed.

Regarding the proxy-kernel–this looks a bit similar to the microTVM RPC server. While I’m happy to admit that the RPC server is a bit heavyweight and could be better tested, the advantage of the RPC server is that it implements a protocol TVM knows how to control. When creating a Project API server implementation, the ultimate goal is to provide a connection to a TVM RPC server running on a foreign target. Placing this server on-target allows us to fully test the user experience in CI, rather than just compiled kernels–for example, we can test tvmc run. We were able to get this running with reasonable speed on the ARM Corstone-300 FVP, so it seems possible that we might be able to test this via SPIKE too (it could be too slow or communicating with the firmware could also be slow, in which case we can stick with qemu here too. Just a suggestion to consider here–it would be nice to minimize device runtime complexity!

PhilippvK · August 3, 2022, 5:20am

@areusch Sorry, but I did not completely get your idea.

My current approach is using the MicroTVM Project API similar to tvm/src/runtime/crt/host at main · apache/tvm · GitHub to compile and run MicroTVM models using the stdin/stdout for “emulating” the serial communication. This works pretty well but limits the usage of the Spike Simulator to MicroTVM workloads (which involves compiling a complete target software binary for every run etc.)

Does your proposed solution go towards the same direction or do you want to accomplished something more powerful, let’s say by running an OS inside the Spike Simulator to host an real C++ TVM RPC server (not the MicroTVM bare metal one) which can then directly be used as an RPC target device without the need to have the ProjectAPI in between?

Could you please give me a hint on where to find this implemented for the Corstone300 FVP/QEMU?

areusch · August 9, 2022, 3:49am

@PhilippvK No that’s about right–just since you mentioned the proxy-kernel, not sure if that means you’d use that proxy-kernel to launch workloads on the sim? The microTVM RPC server could theoretically serve the same purpose (e.g. be a bit of firmware running on the sim which TVM can talk to directly).

Ah yeah it could be cast as running a real OS in SPIKE, although microTVM RPC server should run on bare metal. I’m just not sure if the proxy-kernel is an additional layer here or not (really, I think it would be easier to take a look at your solution and see).

Apologies–it’s not landed just yet but it’s here: https://github.com/apache/tvm/pull/12125

PhilippvK · August 16, 2022, 10:52am

My implementation can be found here: tvm/apps/microtvm/spike at feature_microtvm_spike · PhilippvK/tvm · GitHub

It’s pretty much a copy of the MicroTVM Host CRT Template.

As I have a deadline at the end of the week, I was not yet able to open up a pull-request for it.

I also sill have to look into the integration of Spike in the RISC-V Docker images.

areusch · August 18, 2022, 6:23pm

@PhilippvK No worries! I was able to land the RISC-V CI pipeline upstream now. I believe the Dockerfile still needs to get SPIKE added, but after that you should be able to merge your implementation!

Also cc @alter-xp the RISC-V docker image does include the CSI-NN install script, so feel free to continue now leveraging that image.

alter-xp · August 19, 2022, 1:59am

thanks @areusch, I will continue to integrate CSI-NN2.

PhilippvK · August 22, 2022, 10:12am

here is my followup on the riscv Docker image:

github.com/apache/tvm

[Docker][CI][RISC-V] Build riscv-isa-sim (spike) in ci_riscv Docker image to enable RISC-V unit testing

main ← PhilippvK:feature_docker_spike

opened 10:10AM - 22 Aug 22 UTC

PhilippvK

+288 -24

## Context See Discussion in https://discuss.tvm.apache.org/t/coordination-of…-risc-v-integration-in-tvm/13133 ## Summary 1. Update `Dockerfile.ci_riscv` to download RISC-V toolchain and compile Spike simulator (including proxy kernel) - Using SiFives GCC toolchain for Embedded targets (See below) - Two variants of proxy kernel (`pk`, `pk64`) are build to support `rv32gc` as well as `rv64gc` targets 2. Introduce AotTestRunner to use spike in unit tests (currently `rv32gc` only) 3. Enable spike based AoTTestRunner for AoT/CRT related test cases ## Future work I have a followup ready based on this PR, which enable usage of spike for MicroTVM deployment (ProjectAPI Server). ## Questions - Why not use existing CSI-NN GCC toolchain? That toolchain (`Xuantie-900-gcc-linux`) is mainly targeting 64-bit devices capable of running Linux. While it could be used for 32-bit bare metal targets as well (using the `-static` compile flag), it is missing relevant multilib arches, such as `rv32gc+ilp32d` to be usable. Thus, I use Sifives GCC (which does currenlty not support the RVV1.0 extension) - Should we fix the version of spike being build or just use the latest development branch? cc @Mousius @areusch @driazati @gigiblender