In order to bring MobileNet inference support, computation schedules for group conv2d and depthwise conv2d are required to implemented. Currently we have group conv2d support in #4421, however, we don’t have depthwise conv2d support.

Take the first depthwise conv2d layer in MobileNetV1-1.0 for an example, it takes input_shape=(1, 32, 112, 112), weight_shape=(32, 1, 3, 3), and expects output_shape=(1, 32, 112, 112)

### Challenges

First, in depthwise conv2d, the accumulation ONLY takes at the spatial axis. The GEMM instruction performs 16 fused-multiply-add (FMA) operations, while the work load requires 9 such operations. An easy workaround is to transform the input with im2col, so that the depthwise conv2d operator could be easily transformed into matrix multiplication problem. A drawback of this approach is that it consumes a large amount of memory space if we leave it completely leave the im2col operation completely in a separate operator.

Second, to maximize the utilization of GEMM compute unit, it’s better to load inputs in local.wgt_buffer and load weights in local.inp_buffer. Specifically, if we could load 3x3 spatial data in a 1x16 vector as weight buffer in `local.inp_buffer`

and load 16x16 inputs in the `local.wgt_buffer`

, the spatial data could be reused to multiply with all inputs. However, exchanging inputs buffer and weights buffer might cause some additional problems, or is it worth doing so.

Please share your thought in supporting tensorized computation in depthwise convolution operators.