Introduction and motivations
In the past few weeks, we introduced quite few optimizations for AArch64 targets:
- We introduced an intrinsic in order to use efficient AArch64 instructions to run the inner loop of the quantized convolution
- We overloaded fixed point multiplication to use the relevant LLVM intrinsics for AArch64
The aim of those optimizations was to have a “good” enough out-of-the-box performance. We will use this RFC to track down auto-tuner optimizations related to quantized convolution for AArch64 targets (with NHWC layout).
Tuning entities
In the following paragraphs we provide a list of the tuning entities we use in our conv2d schedule.
Unrolling and vectorizing matrix transform
After the im2col
operation we need to interleave the input matrix in a [rows/4, cols/16, 4, 16]
shape. The interior loop on 4 and 16 elements is a perfect candidate for an annotation with a try_unroll_vec
policy.
Reordering gemm
When we calculate gemm we run the computation over a shape of [M//4, N//4, 4, 4]
and then parallelize over the outer shape, which by default is M//4
. However, M
might be too small to offer enough parallelization. The idea is to reorder the outer dimensions [M//4,N//4]
through a (0,1)
or (1,0)
reordering (default is (0,1)
). We then parallelize over the outer dimension.
Unrolling gemm_quantized
intrinsic
The inner loop of GEMM is done through a gemm_quantized_4_4
intrinsic. This is a hand-written piece of AArch64 assembly. In order to introduce loop unrolling we add a boolean knob unroll
and use it within the implementation. In the following snippet we show how the unroll
knob will be used:
if unroll:
k = int(K//16)
for l in range(0,k):
cc_code += main_loop
else:
cc_code += main_loop
cc_code += """ "subs %w[k], %w[k], #1\\n"
"cbnz %w[k], 1b\\n" """
In the future, we might either use more sophisticated policies within the intrinsic implementation (easy) or we might try to move the loop outside the intrinsic (i.e., using normal TIR Itearvars) and use standard TVM annotation entities (this will be hard, because of the final accumulation “after” the loop, see [RFC] Improve quantized convolution performance for armv8 architectures).
Parallel pipelines
As described in the optimization guide of many recent Arm processors (e.g., Neoverse-N1) instructions like uadalp
or umull
could go through different functional units and this requires slightly different instruction scheduling.
If we have a look at the original intrinsic implementation (for the higher part of the first half), we see the following:
// Higher part of a0 * {b0,b1,b2,b3}
"umull v8.8h, v0.8b, v4.8b\\n"
"umull v9.8h, v0.8b, v5.8b\\n"
"umull v10.8h, v0.8b, v6.8b\\n"
"umull v11.8h, v0.8b, v7.8b\\n"
// Higher part of a1 * {b0,b1,b2,b3}
"umull v12.8h, v1.8b, v4.8b\\n"
"umull v13.8h, v1.8b, v5.8b\\n"
"umull v14.8h, v1.8b, v6.8b\\n"
"umull v15.8h, v1.8b, v7.8b\\n"
// Accumulate
"uadalp v16.4s, v8.8h\\n"
"uadalp v17.4s, v9.8h\\n"
"uadalp v18.4s, v10.8h\\n"
"uadalp v19.4s, v11.8h\\n"
"uadalp v20.4s, v12.8h\\n"
"uadalp v21.4s, v13.8h\\n"
"uadalp v22.4s, v14.8h\\n"
"uadalp v23.4s, v15.8h\\n"
So the uadalp
and umull
instructions are batched together (first a round of umulls
and then a round of uadalp
). Depending on the latencies and the micro-architecture under question different implementations will have different behaviour.
// First half
// Higher part of a0 * {b0,b1,b2,b3} and accumulate
"umull v8.8h, v0.8b, v4.8b\\n"
"uadalp v16.4s, v8.8h\\n"
"umull v9.8h, v0.8b, v5.8b\\n"
"uadalp v17.4s, v9.8h\\n"
"umull v10.8h, v0.8b, v6.8b\\n"
"uadalp v18.4s, v10.8h\\n"
"umull v11.8h, v0.8b, v7.8b\\n"
"uadalp v19.4s, v11.8h\\n"
// Higher part of a1 * {b0,b1,b2,b3} and accumulate
"umull v12.8h, v1.8b, v4.8b\\n"
"uadalp v20.4s, v12.8h\\n"
"umull v13.8h, v1.8b, v5.8b\\n"
"uadalp v21.4s, v13.8h\\n"
"umull v14.8h, v1.8b, v6.8b\\n"
"uadalp v22.4s, v14.8h\\n"
"umull v15.8h, v1.8b, v7.8b\\n"
"uadalp v23.4s, v15.8h\\n"
Interleaving those instructions improves the use of the pipeline and the speed of convolution (since the umull
and uadalp
instructions will be able to execute in parallel). Instead of choosing between one or another implementation, we introduce an interleave
boolean knob to switch between a batched vs interleaved intrinsic. This allows the auto-tuner to choose the best implementation for the underlying micro-architecture.
Results
We tested these changes against few known networks (instead of focusing on a single one). In the following table we show the results (we are running on a Neoverse-N1 device, with 4 threads):
Network | TFlite/TVM |
---|---|
inception V3 | 1.1757897163938207 |
inception V4 | 1.1280155642023346 |
resnet 50 | 1.337992772667543 |
squeezenet | 1.476169590643275 |
vgg 16 | 1.102926185491945 |
Few things to note:
- We are now between 10% and 47% faster than TFlite on AArch64
- Tuning time is at most 10 minutes (for the very big networks)
- We didn’t evaluate mobilenet networks, as those rely on depthwise optimizations. We are carrying those optimizations in this RFC and will try to evaluate those at a later stage
PR
The PR for this RFC is available here