[DISCUSS]Introduce RISC-V Vector/Matrix extension

Hi @zhupijuan_lkl

Permit few mentions, strictly w.r.t the RVV 0.7.1 LLVM unsupportedness issue.

@tqchen @cbalint13 Based on discussions from the community and your suggestions, I plan to handle the RISC-V vector/matrix extensions as follows:

As intro, I may repeat the issues of RVV 0.7.1 (all major ASIC HW out there are 0.7.1):

  • Currently T-Head & Sophon ASIC expose older RVV 0.7.1 specs.

  • LLVM does not support RVV 0.7.1, but only the 1.0.0 spec.

  • See the RVV version support of LLVM (implicit exposure via clang):

    $ rpm -q clang
    clang-18.1.0~rc4-2.fc41.x86_64
    
    $ clang --target=riscv64-unknown-elf -print-supported-extensions | grep "'V'"
    clang version 18.1.0 (Fedora 18.1.0~rc4-2.fc41)
        v 1.0       'V' (Vector Extension for Application Processors)
    
  • Another issue of T-Head ASIC implementations (e.g TH1520) is the expensiveness of vsetvli.

  1. For the vector extension, we will still perform scheduling for fixed-length vectors and use tensorize for general vector processing. To support the variable-length vector registers and operations specific to RISC-V vector, we will convert vector expression into load + op + store in the vectorizeloop pass. The load/store operations will use a variable-length style, with the specific length passed through vl, that is, tir.Var. Finally, based on the existing LLVM codegen, we will implement an LLVM codegen for RISC-V to handle special cases (codegen_riscv.cc).

We clearly will be not able to invoke vl.xxx LLVM-IR for RVV 0.7.1 spec. To aleviate it we can still emmit RVV 0.7.1 LLVM-IR using ideas form this hardcoding llvm-ir generator.

Now you mention that special cases (like RVV 0.7.1) to be handled in codegen_riscv.cc, but it also can be handled at code emmision time from TOPI’s tensorize _impl() , and here the context of init/load/store can be even better be catched.

A sketch on the advantage to add it to the TOPI tensorizer _impl() part:

I am not sure if we can capture the distinctions of these three steps (requiring expensive vsetvli contex switches) elegantly at the codegen_riscv.cc time versuos from the TOPI tensorizer.

@zhupijuan_lkl Q: How you see this alternative instead of your codegen_riscv.cc proposal ?

  1. For the matrix extension, considering that LLVM’s support for matrix is still not complete, I plan to adopt the following approach:
  • For algorithm scheduling, since the matrix extension mainly accelerates conv/gemm operations, tensor layout transformations and alignment are typically performed during the scheduling of these cases. Therefore, during layout transformation, we will perform padding to ensure that the tensor shapes meet the requirements for subsequent tiling, thereby addressing the issue of tail blocks.
  • For instruction generation, we will still use tensorize to perform computations on tiled blocks, but the tensorize intrinsics will be inserted directly as LLVM IR. Specifically, we will wrap the matrix extension intrinsics in a C implementation of a micro-kernel, then use Clang to compile it into LLVM IR, and finally insert this LLVM IR into the tensorization code.

The initiative for the matrix extension is very nice, just as-is, I see it as a let’s move forward with it.

  • LLVM also have special upstream support for T-Head many kind of extensions .
  • Thus, we could also look at LLVM’s possible calls from LLVM-IR directly:
$ clang --target=riscv64-unknown-elf -print-supported-extensions | grep xtheadvdot
 xtheadvdot   1.0    'xtheadvdot' (T-Head Vector Extensions for Dot)

Looking forward to your more suggestions, Thanks!

If this is a draft in need to be promoted I put my +1 vote to go forward with your proposal as-it-is now, and will try help your efforts within the PR reviewing times on this topic.

Thanks again @zhupijuan_lkl for your efforts here !