[RFC] Improve depthwise convolution for NHWC layouts on AArch64 targets

Introduction and motivation

Depthwise convolution is a lightweight convolution operation used in mobile networks like mobilenet

The operation is similar to a convolution, but there is no reduction along the channel dimensions (so it applies a 2D convolution for each of the input channels)

While parameters like dilation and/or depth_multiplier are supported (in general) by this operation, they aren’t frequently used, so we will start optimizing for the common case of depth_multiplier==1 and dilation_{w,h}==1 (although we still support these cases).

Current status and proposal

In the arm_cpu strategy it explicitly says:

logger.warning( "depthwise_conv2d with layout NHWC is not optimized for arm cpu." )

This basically means that the Depthwise convolution will be lowered through the generic compute and that no schedule transformations will be applied on the resulting (default) schedule.

This implies two things:

  • Compilation time is very long. This is because injective operations are not in-lined (i.e., a lot of code duplication in the final binary)
  • Run time is very long: since no optimization is applied to the schedule, it is very slow to run

In this RFC we propose a default schedule (no autotuning knobs) to heavily improve the baseline compilation/runtime of networks like mobilenet (with an NHWC layout).

While this work mostly focuses on quantized performance on AArch64 targets, it should give better performance also for fp32 and/or AArch32 targets.

Depthwise convolution schedule

In the following snippet we provide the depthwise convolution schedule we came up with:

n, w, h, c = conv.op.axis
r_h, r_w = conv.op.reduce_axis
co, ci = s[conv].split(c, 8)
wo, wi = s[conv].split(w, 2)
ho, hi = s[conv].split(h, 2)
s[conv].reorder(n, wo, ho, co, wi, hi, r_h, r_w, ci)
s[conv].parallel(wo)
s[conv].vectorize(ci)
s[conv].reorder(n, wo, ho, co, wi, hi, r_h, r_w, ci)
s[conv].parallel(wo)
s[conv].vectorize(ci)

Few notes on the schedule:

  • We split the channels in 8. This is because we were hoping for the inner loop to be lowered by a sequence of smlalsmlal2 intructions. However, as we understood from this discuss post this cannot be achieved without tensorization. While we hope to tensorize the inner loop(s) at a later stage, we kept the 8-factor split, as this still seems to give better performance than to split by bigger or smaller factor.
  • We split the width and the height by 2. This basically means that in the inner loops we are computing 4 outputs at a time. So we have: 4x4 int16x8_t inputs + 3x3 int16x8_t weights + 2x2x2 int32_4 outputs = 33 registers having a single spill on AArch64 targets. Of course, since we are not tensorizing, the compiler chain might not come up with the optimal register allocation. However, we found this to perform the best in our experiments.
  • While different parallelization policies could be put in place, we decided to keep things simple and parallelize on the outer dimension
  • The last transformation (s[conv].vectorize(ci)) should be really substituted by a tensorize transformation.

Important note on Requantization

Please note that in quantized networks every convolution is followed by a requantization. Since we are not packing/unpacking the data, we can fuse the requantization directly in the main computation loop!

In other words (when quantization is used), we can write:

s[conv].compute_at(s[out], hi)

This means that the requantization happens immediately after the width/height reduction along the kernel axis.

Results

Using this vanilla TVM kernel we achieved two results:

  • Compilation time of networks like mobilenet reduced by about a factor of 10
  • We obtained a nice 44% runtime improvement on both quantized mobilenet and mobilenet_v2 networks (for NHWC layouts)

PR

The PR for this RFC has been submitted here: https://github.com/apache/incubator-tvm/pull/6095

1 Like

cc @ramana-arm @anijain2305 @FrozenGene

Thanks for this work. How does this template compare to ACL?

Hi @kevinthesun, Thanks for your comment!

If I write a simple depthwise schedule (without going through Relay), it looks like we are pretty competitive -also faster- compared to ACL (even without the smla/smla2 trick).

However, when I compile it through relay, additional time (even 50%) is spent doing the reduction to calculate the offset contribution.

Reductions in ACL are 10x faster than the ones in TVM. This wasn’t problematic for float32, but in the quantized world reductions are basically everywhere (to calculate the offset contributions of the inputs).

1 Like

Hi @kevinthesun,

After a bit of investigation, I found out that we were not legalizing Depthwise and that was why we were introducing the overhead with Pooling, reductions, etc…

Once I introduced int16 legalizations (i.e., subtracting the offset before the convolution) performance got a 2x boost :slight_smile:

While I don’t have a direct comparison with ACL (yet), I can tell you that for instance mobilenet V2 is now on par with TFlite.

1 Like

Cool! Thanks for your great work!