Hi All,
Background
Recently, RISC-V has been developing rapidly. Among them, hardware extensions for accelerating AI applications (such as Vector/Matrix extensions) are gradually being commercialized, such as products like D1(C906, Vector extension), Lichee Pi (TH1520, C910, Vector extension), and C907(Matrix extension, to be released). We want to add support for RISC-V vector/matrix extensions in TVM, but since they differ from the programming models of ARM’s NEON and Intel’s AMX already supported by TVM, we would like to contribute this part of the codes to the TVM community in the future, so we are seeking suggestions from the community on the implementation.
Introduction
Here is a brief introduction to the RISC-V vector/matrix extensions.
RISC-V Vector Extension
Similar to ARM’s SVE extension, it is a variable-length vector computing instruction set. The variable length is reflected in two aspects: one is that the length of the vector register is agnostic at harware design, but the instruction can be compatible with different vector lengths; the second is that in actual computation, the number of elements actually participating in the computation in the vector register is agnostic, and the number of elements actually participating in the computation can be set through vl
. Here we use vector addition as an example to introduce how to implement it using the RISC-V Vector intrinsic instruction. C[0:n] = A[0:n] + B[0:n].
int rvv_add_float32_m1(float32_t *C, float32_t *A, float32_t *B, int n) {
float *cc = (float *)C;
float *aa = (float *)A;
float *bb = (float *)B;
while (n > 0) {
int vl = vsetvl_e32m1(n);
vfloat32m1_t _in0 = vle32_v_f32m1(aa, vl);
vfloat32m1_t _in1 = vle32_v_f32m1(bb, vl);
vfloat32m1_t _sum = vfadd_vv_f32m1(_in0, _in1, vl);
vse32_v_f32m1(cc, _sum, vl);
cc += vl;
aa += vl;
bb += vl;
n -= vl;
}
return 0;
}
This programming model has the following characteristics:
- The bit width of the vector register is transparent to the intrinsic, and the user does not need to care about the actual vector width of the hardware. In other words, the same implementation can adapt to different hardware implementations;
- The actual length of data to be processed(
vl
) can be obtained through thevsetvl
instruction, and vl will also be used in the remaining instructions to guide the amount of data to be processed simultaneously during the actual vector operation.
RISC-V Matrix Extension
The Matrix extension is used to calculate matrix block multiplication and also adopts a variable-length design, where the M/N/K of the matrix can be configured.
int rvm_4x4_macc_fp32(float *cc, float *aa, float *bb, int sa, int sb, int sc) {
mrow_t row = 4;
mcol_t col = 4;
long stride_a = sa * sizeof(float);
long stride_b = sb * sizeof(float);
long stride_c = sc * sizeof(float);
mfloat32_t ma = __riscv_th_mld(aa, stride_a, row, col);
// Assuming b is a constant, use msld to load
mfloat32_t mb = __riscv_th_msld(bb, stride_b, row, col);
mfloat32_t mc = __riscv_th_mld(cc, stride_c, row, col);
mc = __riscv_th_fmmacc(mc, ma, mb, row, row, col);
__riscv_th_mst(cc, stride_c, mc, row, col);
return 0;
}
Implementation
Possible Implementation Methods
Overall, Whether it is a vector or matrix extension, the implementation in TVM mainly has two ways:
- Extend
codegen_c
, directly generate intrinsic C code; - Extend
codegen_llvm
, interface with LLVM IR intrinsics, and generate code directly through LLVM.
Existing Problems
Common Problems
Regardless of which implementation method in Possible Implementation Methods
is adopted, there are the following common problems:
- TVM is currently unable to implement the semantics of variable vector length. One is for the case where the vector register width of the target hardware is variable, TVM needs to have a semantics to represent it; the second is that in RISC-V, the number of data units actually processed by the SIMD instruction can be variable-length, such as controlled by
vl
, and the current TVM cannot represent this phenomenon in the generated TensorIR code. I know the community has SVE to solve vector-length-agnostic with predication, but it is not suitable for RISC-V Vector because ofvl
. - In the process of scheduling, if the tensor’s shape cannot be evenly divided by the maximum number of element units that the SIMD register can accommodate, then
T.where
will cause vectorize or tensorize to become ineffective, for example:
@T.prim_func
def myadd(A: T.Buffer((1024, 1025), "float32"), B: T.Buffer((1024, 1025), "float32"), C: T.Buffer((1024, 1025), "float32")):
T.func_attr({"global_symbol": "main", "tir.noalias": T.bool(True)})
# with T.block("root"):
for i, j_0 in T.grid(1024, 257):
for j_1 in T.vectorized(4):
with T.block("C"):
v_i = T.axis.spatial(1024, i)
v_j = T.axis.spatial(1025, j_0 * 4 + j_1)
T.where(j_0 * 4 + j_1 < 1025)
T.reads(A[v_i, v_j], B[v_i, v_j])
T.writes(C[v_i, v_j])
C[v_i, v_j] = A[v_i, v_j] + B[v_i, v_j]
If sch.pad_einsum
is used to pad the tensor to a more suitable shape in advance, and then vectorize/tensorize is performed, it will increase the copying and memory allocation of the input and output buffers, which will affect the actual performance; If sch.loop_partition
is used to divide the loop into a main loop that can be evenly divided and a loop tail, then there are two problems:
- The tail data can only be processed as a scalar, and cannot take advantage of the vector extension’s ability to process variable-length data, which will also cause performance problems;
- For multi-loop operations like matrix multiplication, the problem of the outer loop not being able to be evenly divided cannot be solved by loop_partition.
Special Problems
For implementation method 1) in Possible Implementation Methods
, it is more customized and not conducive to extension and use;
For implementation method 2) in Possible Implementation Methods
, calling LLVM intrinsics through tensorize, the current support for the RISC-V matrix extension is not good enough.
Conclusion
This article is mainly to seek suggestions from the community on how they hope we will implement the support for RISC-V vector/matrix extensions in TVM. Any suggestions are welcome!