[DISCUSS]Introduce RISC-V Vector/Matrix extension

zhupijuan_lkl · March 7, 2024, 12:52pm

Hi All,

Background

Recently, RISC-V has been developing rapidly. Among them, hardware extensions for accelerating AI applications (such as Vector/Matrix extensions) are gradually being commercialized, such as products like D1(C906, Vector extension), Lichee Pi (TH1520, C910, Vector extension), and C907(Matrix extension, to be released). We want to add support for RISC-V vector/matrix extensions in TVM, but since they differ from the programming models of ARM’s NEON and Intel’s AMX already supported by TVM, we would like to contribute this part of the codes to the TVM community in the future, so we are seeking suggestions from the community on the implementation.

Introduction

Here is a brief introduction to the RISC-V vector/matrix extensions.

RISC-V Vector Extension

Similar to ARM’s SVE extension, it is a variable-length vector computing instruction set. The variable length is reflected in two aspects: one is that the length of the vector register is agnostic at harware design, but the instruction can be compatible with different vector lengths; the second is that in actual computation, the number of elements actually participating in the computation in the vector register is agnostic, and the number of elements actually participating in the computation can be set through vl. Here we use vector addition as an example to introduce how to implement it using the RISC-V Vector intrinsic instruction. C[0:n] = A[0:n] + B[0:n].

int rvv_add_float32_m1(float32_t *C, float32_t *A, float32_t *B, int n) {
    float *cc = (float *)C;
    float *aa = (float *)A;
    float *bb = (float *)B;

    while (n > 0) {
        int vl = vsetvl_e32m1(n);
        vfloat32m1_t _in0 = vle32_v_f32m1(aa, vl);
        vfloat32m1_t _in1 = vle32_v_f32m1(bb, vl);
        vfloat32m1_t _sum = vfadd_vv_f32m1(_in0, _in1, vl);
        vse32_v_f32m1(cc, _sum, vl);

        cc += vl;
        aa += vl;
        bb += vl;

        n -= vl;
    }
    return 0;
}

This programming model has the following characteristics:

The bit width of the vector register is transparent to the intrinsic, and the user does not need to care about the actual vector width of the hardware. In other words, the same implementation can adapt to different hardware implementations;
The actual length of data to be processed(vl) can be obtained through the vsetvl instruction, and vl will also be used in the remaining instructions to guide the amount of data to be processed simultaneously during the actual vector operation.

RISC-V Matrix Extension

The Matrix extension is used to calculate matrix block multiplication and also adopts a variable-length design, where the M/N/K of the matrix can be configured.

int rvm_4x4_macc_fp32(float *cc, float *aa, float *bb, int sa, int sb, int sc) {
    mrow_t row = 4;
    mcol_t col = 4;
    long stride_a = sa * sizeof(float);
    long stride_b = sb * sizeof(float);
    long stride_c = sc * sizeof(float);

    mfloat32_t ma = __riscv_th_mld(aa, stride_a, row, col);
    // Assuming b is a constant, use msld to load
    mfloat32_t mb = __riscv_th_msld(bb, stride_b, row, col);
    mfloat32_t mc = __riscv_th_mld(cc, stride_c, row, col);

    mc = __riscv_th_fmmacc(mc, ma, mb, row, row, col);
    __riscv_th_mst(cc, stride_c, mc, row, col);

    return 0;
}

Implementation

Possible Implementation Methods

Overall, Whether it is a vector or matrix extension, the implementation in TVM mainly has two ways:

Extend codegen_c, directly generate intrinsic C code;
Extend codegen_llvm, interface with LLVM IR intrinsics, and generate code directly through LLVM.

Existing Problems

Common Problems

Regardless of which implementation method in Possible Implementation Methods is adopted, there are the following common problems:

TVM is currently unable to implement the semantics of variable vector length. One is for the case where the vector register width of the target hardware is variable, TVM needs to have a semantics to represent it; the second is that in RISC-V, the number of data units actually processed by the SIMD instruction can be variable-length, such as controlled by vl, and the current TVM cannot represent this phenomenon in the generated TensorIR code. I know the community has SVE to solve vector-length-agnostic with predication, but it is not suitable for RISC-V Vector because of vl.
In the process of scheduling, if the tensor’s shape cannot be evenly divided by the maximum number of element units that the SIMD register can accommodate, then T.where will cause vectorize or tensorize to become ineffective, for example:

@T.prim_func
def myadd(A: T.Buffer((1024, 1025), "float32"), B: T.Buffer((1024, 1025), "float32"), C: T.Buffer((1024, 1025), "float32")):
    T.func_attr({"global_symbol": "main", "tir.noalias": T.bool(True)})
    # with T.block("root"):
    for i, j_0 in T.grid(1024, 257):
        for j_1 in T.vectorized(4):
            with T.block("C"):
                v_i = T.axis.spatial(1024, i)
                v_j = T.axis.spatial(1025, j_0 * 4 + j_1)
                T.where(j_0 * 4 + j_1 < 1025)
                T.reads(A[v_i, v_j], B[v_i, v_j])
                T.writes(C[v_i, v_j])
                C[v_i, v_j] = A[v_i, v_j] + B[v_i, v_j]

If sch.pad_einsum is used to pad the tensor to a more suitable shape in advance, and then vectorize/tensorize is performed, it will increase the copying and memory allocation of the input and output buffers, which will affect the actual performance; If sch.loop_partition is used to divide the loop into a main loop that can be evenly divided and a loop tail, then there are two problems:

The tail data can only be processed as a scalar, and cannot take advantage of the vector extension’s ability to process variable-length data, which will also cause performance problems;
For multi-loop operations like matrix multiplication, the problem of the outer loop not being able to be evenly divided cannot be solved by loop_partition.

Special Problems

For implementation method 1) in Possible Implementation Methods, it is more customized and not conducive to extension and use;

For implementation method 2) in Possible Implementation Methods, calling LLVM intrinsics through tensorize, the current support for the RISC-V matrix extension is not good enough.

Conclusion

This article is mainly to seek suggestions from the community on how they hope we will implement the support for RISC-V vector/matrix extensions in TVM. Any suggestions are welcome!

@tqchen @elenkalda-arm @Hzfengsy @LeiWang1999

tqchen · March 7, 2024, 5:36pm

I think it would be useful to discuss possible ways through tensorization, which should be the path for matrix extensions i think as it helps generic accelerator targets

zhupijuan_lkl · March 8, 2024, 1:55am

@tqchen Thanks for your response. Yes, tensorize is our preferred implementation path. But we are currently facing two main problems:

For both vectorize and tensorize, when the loop cannot be completed split, there will be a tail loop problem, which will interrupt vectorize/tensorize. If the pad_einsum schedule is applied, it will increase redundant memory copying. If the loop partition schedule is adopted, the tail part of the loop cannot use the vector/matrix instruction for acceleration;
Due to the current lack of expressions similar to vl in the RISC-V vector extension in TVM, especially when calling intrinsics, vl will be passed as a parameter to the instruction execution, so it is not possible to fully realize the advantages of vector/matrix optimization. We want to try adding related semantics, but we are not sure if the community is willing to accept it, as it may not be that general for other platforms…

tqchen · March 8, 2024, 2:31pm

Thanks for the reply. I think we can start by enabling these instruction still in a reasonably fixed length, note that they are still helpful since most HW do have related vector len unit requirement.

After we finish this phase, in the meantime, we can discuss how tensorization can help better support some of the tail blocks by providing hint to the vectors.

My guess is that likely we could leverage tir.Var to represent vl and related variable.

cbalint13 · March 13, 2024, 11:27am

Hi @zhupijuan_lkl @tqchen

First, I am happy to see efforts towards RVV support in TVM !

Allow me to share some views on TVM related things about RVV issues I encountered:

TH1520 (Lichee PI 4A) have a older RVV 0.7.1, unsupported by LLVM (implicitly TVM may suffer)
Earliest attempt to circumvent the LLVM v071 unsupportedness can be found in TVM PR#14836
I personally dislike using proprietary tools (v071 capable) and there is a way to avoid this in TVM !

In order to make TVM (TOPI catalog) play nicely across multiple RVV versions (this ISA versioning pace may be a bigger issue in future for LLVM) one can implement pure LLVM-IR generators directly in the TOPI python interface, (or C invoking asm(.word 0xINSN) that gets lowered to LLVM-IR) for the older RVV ISA parts.

Here is such a working C/LLVM-IR generator: GitHub - cbalint13/rvv-kernels: RISCV Vector Kernels

Please note that in the case of TH1520 ASIC (I hope this is only a particular case in the risc-v SBC landscape) due to setvli expensiveness it is not really possible to cope with the SVE dynamic concepts (bad luck for Lichee PI 4A) thus for best performance there can be only static kernels (but lane/macs are still parametrizable at code/kernel generation time dinamically).

I am also looking for collaboration to add RVV support to TVM (my approach would by to generate LLVM-IR within TOPI python tenzorizer interface) for v071 & v100 x int8/fp16/fp32 with the disclaimer that I am doing solely for fun, I am not representing (and never was) any kind of company interests.

zhupijuan_lkl · March 16, 2024, 1:49am

@tqchen @cbalint13 Based on discussions from the community and your suggestions, I plan to handle the RISC-V vector/matrix extensions as follows:

For the vector extension, we will still perform scheduling for fixed-length vectors and use tensorize for general vector processing. To support the variable-length vector registers and operations specific to RISC-V vector, we will convert vector expression into load + op + store in the vectorizeloop pass. The load/store operations will use a variable-length style, with the specific length passed through vl, that is, tir.Var. Finally, based on the existing LLVM codegen, we will implement an LLVM codegen for RISC-V to handle special cases (codegen_riscv.cc).
For the matrix extension, considering that LLVM’s support for matrix is still not complete, I plan to adopt the following approach:
- For algorithm scheduling, since the matrix extension mainly accelerates conv/gemm operations, tensor layout transformations and alignment are typically performed during the scheduling of these cases. Therefore, during layout transformation, we will perform padding to ensure that the tensor shapes meet the requirements for subsequent tiling, thereby addressing the issue of tail blocks.
- For instruction generation, we will still use tensorize to perform computations on tiled blocks, but the tensorize intrinsics will be inserted directly as LLVM IR. Specifically, we will wrap the matrix extension intrinsics in a C implementation of a micro-kernel, then use Clang to compile it into LLVM IR, and finally insert this LLVM IR into the tensorization code.

Looking forward to your more suggestions, Thanks!

cbalint13 · March 16, 2024, 8:53am

Hi @zhupijuan_lkl

Permit few mentions, strictly w.r.t the RVV 0.7.1 LLVM unsupportedness issue.

@tqchen @cbalint13 Based on discussions from the community and your suggestions, I plan to handle the RISC-V vector/matrix extensions as follows:

As intro, I may repeat the issues of RVV 0.7.1 (all major ASIC HW out there are 0.7.1):

Currently T-Head & Sophon ASIC expose older RVV 0.7.1 specs.
LLVM does not support RVV 0.7.1, but only the 1.0.0 spec.

See the RVV version support of LLVM (implicit exposure via clang):

$ rpm -q clang
clang-18.1.0~rc4-2.fc41.x86_64

$ clang --target=riscv64-unknown-elf -print-supported-extensions | grep "'V'"
clang version 18.1.0 (Fedora 18.1.0~rc4-2.fc41)
    v 1.0       'V' (Vector Extension for Application Processors)

Another issue of T-Head ASIC implementations (e.g TH1520) is the expensiveness of vsetvli.

For the vector extension, we will still perform scheduling for fixed-length vectors and use tensorize for general vector processing. To support the variable-length vector registers and operations specific to RISC-V vector, we will convert vector expression into load + op + store in the vectorizeloop pass. The load/store operations will use a variable-length style, with the specific length passed through vl, that is, tir.Var. Finally, based on the existing LLVM codegen, we will implement an LLVM codegen for RISC-V to handle special cases (codegen_riscv.cc).

We clearly will be not able to invoke vl.xxx LLVM-IR for RVV 0.7.1 spec. To aleviate it we can still emmit RVV 0.7.1 LLVM-IR using ideas form this hardcoding llvm-ir generator.

Now you mention that special cases (like RVV 0.7.1) to be handled in codegen_riscv.cc, but it also can be handled at code emmision time from TOPI’s tensorize _impl() , and here the context of init/load/store can be even better be catched.

A sketch on the advantage to add it to the TOPI tensorizer _impl() part:

For init part in case of 0.7.1 you can use only once the setvli (for max performance): INIT: rvv-kernels/dot_int8_kernel.c at main · cbalint13/rvv-kernels · GitHub
For load part in case of 0.7.1 you can re-use already set setvli context: LOAD: rvv-kernels/dot_int8_kernel.c at main · cbalint13/rvv-kernels · GitHub
For final store (flush to main memory) in case of 0.7.1 you again invoke once: STORE: rvv-kernels/dot_int8_kernel.c at main · cbalint13/rvv-kernels · GitHub

I am not sure if we can capture the distinctions of these three steps (requiring expensive vsetvli contex switches) elegantly at the codegen_riscv.cc time versuos from the TOPI tensorizer.

@zhupijuan_lkl Q: How you see this alternative instead of your codegen_riscv.cc proposal ?

For the matrix extension, considering that LLVM’s support for matrix is still not complete, I plan to adopt the following approach:

For algorithm scheduling, since the matrix extension mainly accelerates conv/gemm operations, tensor layout transformations and alignment are typically performed during the scheduling of these cases. Therefore, during layout transformation, we will perform padding to ensure that the tensor shapes meet the requirements for subsequent tiling, thereby addressing the issue of tail blocks.

For instruction generation, we will still use tensorize to perform computations on tiled blocks, but the tensorize intrinsics will be inserted directly as LLVM IR. Specifically, we will wrap the matrix extension intrinsics in a C implementation of a micro-kernel, then use Clang to compile it into LLVM IR, and finally insert this LLVM IR into the tensorization code.

The initiative for the matrix extension is very nice, just as-is, I see it as a let’s move forward with it.

LLVM also have special upstream support for T-Head many kind of extensions .
Thus, we could also look at LLVM’s possible calls from LLVM-IR directly:

$ clang --target=riscv64-unknown-elf -print-supported-extensions | grep xtheadvdot
 xtheadvdot   1.0    'xtheadvdot' (T-Head Vector Extensions for Dot)

Looking forward to your more suggestions, Thanks!

If this is a draft in need to be promoted I put my +1 vote to go forward with your proposal as-it-is now, and will try help your efforts within the PR reviewing times on this topic.

Thanks again @zhupijuan_lkl for your efforts here !