[RFC] Accelerate quantized convolution through dot-product

Motivation

In recent RFCs we successfully boosted convolution performance on native Armv8-A architectures. When using Armv8.2-A and above ISAs, developers are provided with a richer set of instructions, among which the dot-product instruction udot (or sdot) can be particularly useful for Machine Learning applications (as a reference, see the Neoverse optimization guide).

Basic udot/sdot functioning

The instruction

udot v0.4s, v1.16b, v2.16b

Subdivides the registers v1 and v2 in blocks of 4 uint8 elements and places their dot-product into the corresponding 32bit word in v0. You can see this operation depicted in the following picture:

Another less known version of this instruction is the indexed dot-product:

udot v0.4s, v1.16b, v2.16b[0]

This instruction is taking the first 4 uint8 elements of vector v2 and producing the dot-product with each groups of 4 elements from vector v1. This is depicted in the following picture:

This last version is the one we will use through the remaining of this RFC.

Implementation strategy

We decided to add dot-product support through two intrinsics and to exploit those intrinsics through tensorization. Differently from the previous intrinsic for Armv8-A (which was written through inline assembly), we have been able to write them entirely through TIR/LLVM instructions. The main difference is that, given two tiles tile_A and tile_B the output tile_C produced with the dot-product is partial but correct. In the case of Armv8-A, instead, we needed some additional assembly magic (i.e., addp instructions) to produce the correct partial tile.

Strategy #1: 8x12 output tile, A interleaved and B transposed and interleaved

In this case the approach is very similar to the Armv8-A RFC.

Interleave A: We interleave (and pad if necessary) the rows of A in blocks of 8x4. This means that each tile will contain 4 consecutive elements of 8 rows of A.

Interleave and transpose B: We block transpose B as in Armv8-A RFC. In this case though, we use blocks of 12x4. Each tile of the reshaped B will contain 4 consecutive elements of 12 columns of B

Computation through dot-product: We use an mmla4x4 intrinsic in order to produce a 4x4 (interleaved) tile given 4x4 tiles from A and B. Please note that we will unroll it by two, in order to produce the correct 8x4 output tile.

This is the rule we are using:

vec_a = ins[0].vload([0, 0], dtype_vec) # Now vec_a contains 4 rows of A (4 elements each)
vec_aa = [select_word(vec_a, i, dtype_vec) for i in range(0, 4)] # Select the i-th row
vec_b = ins[1].vload([0, 0], dtype_vec) # vec_b contains the 4 columns of B (4 elements each)

# Execute the matrix multiplication
for i in range(0, 4):
    vec_c = outs[0].vload([i, 0], 'int32x4')
    vdot = tvm.tir.call_llvm_intrin(
                          'int32x4',
                          'llvm.aarch64.neon.sdot',
                          tvm.tir.const(3, 'uint32'),
                          vec_c, vec_b, vec_aa[i])

      # Store the result
      ib.emit(outs[0].vstore([i, 0], vdot))

We will give some more information about select_word later in this RFC

Strategy #2: 4x16 output tile, A native and B transposed and interleaved

This strategy is different from the one we previously adopted, and deserves some more explanation.

A is in native form: We don’t interleave A, but we do pad it if necessary. Now the i-th load instruction is loading 16 elements from the i-th row of A

Interleave and transpose B: For B nothing changes. We tile in the same way we did previously, but with a different 16x4 tile shape. Each tile of the reshaped B will contain 4 consecutive elements of 16 columns of B

Computation through dot-product: We use an mmla16x4 intrinsic. The inputs are a Rx4 tile of A (R is the number of resulting rows) and a 16x4 tile of B. Before showing any code, we provide the tiled computation in the following picture, where R is set to 4. The idea is the following:

  1. A single load reads 16 consecutive elements from matrix A (which is in its native form). 4 of them are green, 4 of them are blue and so on
  2. The first output row C[0,0:4] is produced in the following way:
  `C[0, 0:4] = A[0,0:4] *B_interleaved_t[0:4,0:4]`
  `C[0, 0:4] += A[0,4:8] *B_interleaved_t[4:8,0:4]`
  `C[0, 0:4] += A[0,8:12] *B_interleaved_t[8:12,0:4]`
  `C[0, 0:4] += A[0,12:16] *B_interleaved_t[12:16,0:4]`
  1. Repeat the same operation for each the R rows of C

Few things worth underlying:

  • In the picture we tried to render the algorithm with different colors: multiplications only happen between tiles of same colors
  • The tiles of B-interleaved_t in the picture do not represent the real memory layout. Basically tile [0,0] is stored by rows, followed by tile [1,0], [2,0], [3,0], [0, 1], etc… (this reinforces the fact that B_interleaved_t is a block transposed version of B)
  • Very importantly, the output C is already in its native form. We thus don’t need to unpack it

For completeness we write down the tensorization node we use to implement the above tiled computation:

for k in range(0, rows):
    vec_a = ins[0].vload([k, 0], dtype_vec)

        for j in range(0, 4):
            for i in range(0, 4):
                vec_aa = select_word(vec_a, i, dtype_vec)
                vec_b = ins[1].vload([i, 4*j, 0], dtype_vec)
                vec_c = outs[0].vload([k, 4*j], 'int32x4')
                vdot = tvm.tir.call_llvm_intrin(
                       'int32x4',
                       'llvm.aarch64.neon.sdot',
                       tvm.tir.const(3, 'uint32'),
                       vec_c, vec_b, vec_aa)

How to produce the correct indexed dot-product: select_word() function:

The indexed dot-product is not available as an LLVM intrinsic. It is instead produced as a LLVM/IR optimization when we do:


# Reinterpret vec_a as 4 int32 words
vec_int32 = tvm.tir.call_intrin('int32x4', 'tir.reinterpret', vec)
# Broadcast the lane-th word
vec_int32_shuffled = tvm.tir.Shuffle([vec_int32], [lane, lane, lane, lane])
# Convert back to uint8x16
vec_int8_broadcast = tvm.tir.call_intrin(dtype_vec, 'tir.reinterpret', vec_int32_shuffled)

udot(vec_c, vec_b, vec_int8_broadcast)

The first 3 instructions are implemented in a utility function named select_word in topi/arm_cpu/tensor_intrin.py

Why implementing both strategies?

If we run some math, we can see that the number of memory accesses in the case of the interleaved approach is slightly smaller compared to the hybrid approach. However, the idea is that the hybrid kernels don’t need interleaving of data and un-interleaving of the output. Since we try to fuse those transformations it is not entirely clear which one is best. The best approach is to let the tuner decide the winner

Performance improvements

In order to initially test performance improvements, we consider again inception_V3 (which is a good benchmark, given its shape variety) running on a Neoverse N1 machine.

The results we measured are as follows:

  • 2.41x improvement compared to the Armv8-A implementation
  • About 5% slower than ArmNN (which uses ACL under the hood)

These are encouraging results which is why we will submit this improvement as is, before adventuring in more exotic optimizations.

Next steps

Comparing performance across different networks

While the results for inception_v3 were satisfactory, we will compare performance for other networks against ArmNN. This is to understand if there are big gaps that need to be considered.

Improving performance further: padding and fusion

The hybrid strategy aims at avoiding memory-bound operations (like packing/unpacking) and gives us the possibility to fuse the requantization directly during the main computation. However, we ran into the following issues:

  • Since we are applying the mmla16x4 intrinsic through tensorization, we need to pad A beforehand which is actually a memory-bound operation, defeating the benefits given by this approach. Simple approaches to remove padding seem ineffective:

    • If we don’t pad and run tensorize over a variable dimension tiles, it simply fails (see this discuss post)
    • If we don’t pad and run tensorize only over fixed dimension tiles, @tir.likely statements appear hitting performance.
  • For the same reason we cannot fuse the requantization during the computation. In addition to the inability to compute_at within tensorize, we are also blocked by the inability to compute_at on fused/split axis (as mentioned in this post)

We are currently working to find a well designed solution in order to address both the issues. Possible solutions are still being evaluated and every suggestion is welcome!

PR

The PR for this RFC is available here: https://github.com/apache/incubator-tvm/pull/6445

7 Likes

cc @anijain2305, @FrozenGene, @ramana-arm