An Alternative Bitpacking Procedure for ARM CPUs

The current implementation for ARM operates on the 3rd dimension (pack_axis=2). This leads to noticeably higher runtimes for shapes that have height / width shapes values smaller than 32, i.e. N = M < 32. This is likely due to the implementation mapping poorly onto the available ARM SIMD instructions. Is it possible to change the dimension upon which the bitpacking is performed from a spatial dimension to the channel dimension?

I have already attempted replacing the aforementioned value for the pack_axis argument with that for the channel dimension, but I receive an error with an output I have not been able to trace.

tvm._ffi.base.TVMError: Traceback (most recent call last):
  4: TVMFuncCall
  3: _ZNSt17_Function_handlerIFvN3tvm7runtime7TVMArgsEPNS1_11TVMR
  2: tvm::runtime::RPCWrappedFunc::operator()(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*) const
  1: tvm::runtime::RPCClientSession::CallFunc(void*, TVMValue const*, int const*, int, std::function<void (tvm::runtime::TVMArgs)> const&)
  0: tvm::runtime::RPCEndpoint::CallFunc(void*, TVMValue const*, int const*, int, std::function<void (tvm::runtime::TVMArgs)>)
  File "/home/bsparks/tvm/src/runtime/rpc/rpc_endpoint.cc", line 797
TVMError: 
---------------------------------------------------------------
An error occurred during the execution of TVM.
For more information, please see: https://tvm.apache.org/docs/errors.html
---------------------------------------------------------------
  Check failed: (code == RPCCode::kReturn) is false: code=kShutdown