Introduction and motivation
Depthwise convolution is a lightweight convolution operation used in mobile networks like mobilenet
The operation is similar to a convolution, but there is no reduction along the channel dimensions (so it applies a 2D convolution for each of the input channels)
While parameters like dilation
and/or depth_multiplie
r are supported (in general) by this operation, they aren’t frequently used, so we will start optimizing for the common case of depth_multiplier==1
and dilation_{w,h}==1
(although we still support these cases).
Current status and proposal
In the arm_cpu
strategy it explicitly says:
logger.warning(
"depthwise_conv2d with layout NHWC is not optimized for arm cpu."
)
This basically means that the Depthwise convolution will be lowered through the generic compute and that no schedule transformations will be applied on the resulting (default) schedule.
This implies two things:
- Compilation time is very long. This is because injective operations are not in-lined (i.e., a lot of code duplication in the final binary)
- Run time is very long: since no optimization is applied to the schedule, it is very slow to run
In this RFC we propose a default schedule (no autotuning knobs) to heavily improve the baseline compilation/runtime of networks like mobilenet
(with an NHWC layout).
While this work mostly focuses on quantized performance on AArch64 targets, it should give better performance also for fp32
and/or AArch32 targets.
Depthwise convolution schedule
In the following snippet we provide the depthwise convolution schedule we came up with:
n, w, h, c = conv.op.axis
r_h, r_w = conv.op.reduce_axis
co, ci = s[conv].split(c, 8)
wo, wi = s[conv].split(w, 2)
ho, hi = s[conv].split(h, 2)
s[conv].reorder(n, wo, ho, co, wi, hi, r_h, r_w, ci)
s[conv].parallel(wo)
s[conv].vectorize(ci)
s[conv].reorder(n, wo, ho, co, wi, hi, r_h, r_w, ci)
s[conv].parallel(wo)
s[conv].vectorize(ci)
Few notes on the schedule:
- We split the channels in 8. This is because we were hoping for the inner loop to be lowered by a sequence of
smlal
→smlal2
intructions. However, as we understood from this discuss post this cannot be achieved without tensorization. While we hope to tensorize the inner loop(s) at a later stage, we kept the 8-factor split, as this still seems to give better performance than to split by bigger or smaller factor. - We split the width and the height by 2. This basically means that in the inner loops we are computing 4 outputs at a time. So we have:
4x4 int16x8_t inputs + 3x3 int16x8_t weights + 2x2x2 int32_4 outputs = 33
registers having a single spill on AArch64 targets. Of course, since we are not tensorizing, the compiler chain might not come up with the optimal register allocation. However, we found this to perform the best in our experiments. - While different parallelization policies could be put in place, we decided to keep things simple and parallelize on the outer dimension
- The last transformation (
s[conv].vectorize(ci)
) should be really substituted by atensorize
transformation.
Important note on Requantization
Please note that in quantized networks every convolution is followed by a requantization. Since we are not packing/unpacking the data, we can fuse the requantization directly in the main computation loop!
In other words (when quantization is used), we can write:
s[conv].compute_at(s[out], hi)
This means that the requantization happens immediately after the width/height reduction along the kernel axis.
Results
Using this vanilla TVM kernel we achieved two results:
- Compilation time of networks like
mobilenet
reduced by about a factor of 10 - We obtained a nice 44% runtime improvement on both quantized
mobilenet
andmobilenet_v2
networks (for NHWC layouts)
PR
The PR for this RFC has been submitted here: https://github.com/apache/incubator-tvm/pull/6095