Questions on the ARM Cortex-M Microkernels in TVM

PhilippvK · July 12, 2022, 7:01am

I recently tried out the mprofile schedules using tensorization to replace the innermost loops of Convolutions, dense and pooling layers with hardcoded C- microkernels (See https://github.com/apache/tvm/pull/9233). A few questions came up and I would be happy if someone could answer those.

Is there a specific reason why support for depthwise convolutions was not implemented? Maybe it would be a good idea to add schedules for this as well.
The microkernels contain some threshold values to fallback to a loop-based implementation if some input dimensions are too small. Which approach was followed to find the optimal values for these thresholds on any device?
I was wondering why the DSP optimized variant of the “nn.dense” operator was never invoked for several TFLite model. After having a look at the schedule definition (https://github.com/apache/tvm/blob/fc419df32f052e21f614c8940699c10a2d696689/python/tvm/topi/arm_cpu/mprofile/dsp/dense.py#L41) and microkernel (https://github.com/apache/tvm/blob/fc419df32f052e21f614c8940699c10a2d696689/python/tvm/topi/arm_cpu/mprofile/dsp/micro_kernel/gemm.py#L474) it turned out that a batch (M) size greater than 2 (int16) or 16 (int8) is required to make use of the optimized code. it there a specific reason for this? If a GEMM microkernel does not perform well for the inference of a fully connected layer, wouldn’t it make more sense to use a simple dot-product instead?
When commenting the threshold check for the int8 gemm microkernel (see https://github.com/apache/tvm/blob/fc419df32f052e21f614c8940699c10a2d696689/python/tvm/topi/arm_cpu/mprofile/dsp/micro_kernel/gemm.py#L316) to enable accelerated “nn.dense” support given a batch_size of 1, I ran into a major issue: The stack got corrupted when using the toycar.tflite as the microkernels are allocation a local scratch pad buffer int16_t bb_pad[{bb_pad_size}]; of size (N x K = 640 x 128) which exceeds the used stack size of 16kB. I could have fixed this by increasing the stack size of my simulator but that seems a bit unrealistic.
I can see that the DSP-optimized schedules for the supported operators have specific requirements on the data and kernel layouts used in the model, thus using the “ConvertLayout” Pass might be required to actually used these schedules. Unfortunately this can not be achieved using the TVMC command like, as the “—desired-layout” is quite limited in it’s current form: (A) Only affects convolutions, thus pooling layouts can not be changed, (B) Non-default kernel layout (like HWOI for NHWC conv2d as required for the SIMD microkernels) can not be enforced via the CLI. (C) A single data layout is used for the complete graph, thus you wouldn’t be able to enforce NCHW for nn.avg_pool2d and NHWC for nn.conv2d at the same time.
Currently during the legalization pass of “qnn.conv2d” for arm_cpu device the 8-bit inputs in a TFLite pre-quantized model are casted to 16-bit integers unless is_fast_int8_on_arm or is_aarch64_arm. This reduces the achievable speedup using the implemented microkernels. Can we avoid this by checking for the availability of the DSP extensions inside https://github.com/apache/tvm/blob/97b3076c3532f73a9d9eeba26a3f329f8e0f803d/python/tvm/relay/qnn/op/legalizations.py#L405?

I would really appreciate any answers and am looking forward to improve the usability of this feature in the future.

CCing those who have been involved in the PR: @areusch @Mousius @ilyag-grovety @mehrdadh (+ sergey-grovety, GermanTretiakov, Alex-grovety, u99127 which I could not found on Discuss)

areusch · July 12, 2022, 4:30pm

hi @PhilippvK thanks for all these questions! I’ll do my best to answer them below.

I don’t believe so, and we should add these.

PhilippvK:

The microkernels contain some threshold values to fallback to a loop-based implementation if some input dimensions are too small. Which approach was followed to find the optimal values for these thresholds on any device?

I was wondering why the DSP optimized variant of the “nn.dense” operator was never invoked for several TFLite model. After having a look at the schedule definition (https://github.com/apache/tvm/blob/fc419df32f052e21f614c8940699c10a2d696689/python/tvm/topi/arm_cpu/mprofile/dsp/dense.py#L41 ) and microkernel (https://github.com/apache/tvm/blob/fc419df32f052e21f614c8940699c10a2d696689/python/tvm/topi/arm_cpu/mprofile/dsp/micro_kernel/gemm.py#L474) it turned out that a batch (M) size greater than 2 (int16) or 16 (int8) is required to make use of the optimized code. it there a specific reason for this? If a GEMM microkernel does not perform well for the inference of a fully connected layer, wouldn’t it make more sense to use a simple dot-product instead?

I’ll defer to @ilyag-grovety on these two. I believe they did some testing and found that those thresholds produced more optimal performance. Given they are fairly low, I’d believe this is relatively cache-independent, but I could be convinced otherwise. Additionally, vectorization may produce different results here.

I believe that should’ve gotten converted into a dynamic allocation (if not using USMP) or an access into a memory pool (if using USMP). Could you say more about which option you had selected? cc @manupa-arm in case he knows more here.

In general, TVM’s builtin layout transformations tend to happen at the importer level, since layout transforms inserted into the model during compilations can be expensive. However, Relax will enable TVM to do more graph-level exploration here. Could you capture this into a GH issue so we can have a better look at (C) as Relax lands?

That seems reasonable at first glance, but iirc is_fast_int8_on_arm is related to Cortex-A. I think we could use the dsp feature check mentioned in Target Features RFC for this.

Really appreciate you trying this out and we’d love any help you’re able to provide here–it’d be great to coordinate via microTVM roadmap as I think OctoML will be working on some of this in the next couple months.

manupa-arm · July 13, 2022, 10:25am

areusch:

PhilippvK:

When commenting the threshold check for the int8 gemm microkernel (see https://github.com/apache/tvm/blob/fc419df32f052e21f614c8940699c10a2d696689/python/tvm/topi/arm_cpu/mprofile/dsp/micro_kernel/gemm.py#L316 ) to enable accelerated “nn.dense” support given a batch_size of 1, I ran into a major issue: The stack got corrupted when using the toycar.tflite as the microkernels are allocation a local scratch pad buffer int16_t bb_pad[{bb_pad_size}]; of size (N x K = 640 x 128) which exceeds the used stack size of 16kB. I could have fixed this by increasing the stack size of my simulator but that seems a bit unrealistic.

I believe that should’ve gotten converted into a dynamic allocation (if not using USMP) or an access into a memory pool (if using USMP). Could you say more about which option you had selected? cc @manupa-arm in case he knows more here.

https://github.com/apache/tvm/issues/9022 – I think it might be related but I would assume USMP to capture and mutate the allocate nodes away, if USMP is enabled.

For non-USMP flows, we might need to progress on https://github.com/apache/tvm/pull/9950.

ilyag-grovety · July 13, 2022, 10:29am

PhilippvK:

The microkernels contain some threshold values to fallback to a loop-based implementation if some input dimensions are too small. Which approach was followed to find the optimal values for these thresholds on any device?

I was wondering why the DSP optimized variant of the “nn.dense” operator was never invoked for several TFLite model. After having a look at the schedule definition (https://github.com/apache/tvm/blob/fc419df32f052e21f614c8940699c10a2d696689/python/tvm/topi/arm_cpu/mprofile/dsp/dense.py#L41 ) and microkernel (https://github.com/apache/tvm/blob/fc419df32f052e21f614c8940699c10a2d696689/python/tvm/topi/arm_cpu/mprofile/dsp/micro_kernel/gemm.py#L474) it turned out that a batch (M) size greater than 2 (int16) or 16 (int8) is required to make use of the optimized code. it there a specific reason for this? If a GEMM microkernel does not perform well for the inference of a fully connected layer, wouldn’t it make more sense to use a simple dot-product instead?

Hi @PhilippvK! In general @areusch is right, I just provide some details.

In function gemm the computation time depends quadratically on the size of the matrix. In order to use optimized kernels (with vector operations), we have to do data preparation, the cost of which is linearly depends on size. Thus, the gain is obtained only for sufficiently large N. By measurements we found that for N less than 16 it is more profitable to use a simple loop. Also we fall back to loops if we don’t have enough data for vector instructions or address is not aligned for it. And loops is used to compute “tails” which don’t multiple of desired vector size.

PhilippvK · July 13, 2022, 11:13am

AFAIK the microkernels can not be used together with the USMP feature at the moment. Here is the printed error message:

TVMError:
---------------------------------------------------------------
An error occurred during the execution of TVM.
For more information, please see: https://tvm.apache.org/docs/errors.html
---------------------------------------------------------------
  Check failed: (it != var_idmap_.end()) is false: Find undefined Variable conv2d

ok!

The overflowing scratch buffer is defined in the hardcoded C microkernels (see https://github.com/apache/tvm/blob/5ad27ef6506b5e50b82ee97f1a0a6aaa5fe0dbbf/python/tvm/topi/arm_cpu/mprofile/dsp/micro_kernel/gemm.py#L207), so TVM does not know about this allocation at all. But yes, this should use the global workspace buffer instead of the stack.

PhilippvK · July 13, 2022, 11:18am

Thank you very much for these details. However I still wonder when the DSP optimized dense implementation would be used during inference. At least for TFLite models the batch size M is always 1 and thus the condition if ( {M} < 16 || {N} < 16 ) { would always evaluate to true falling back to the loop implementation.

ilyag-grovety · July 13, 2022, 1:59pm

In case of M=1 it won’t be used, it is inefficient. Roughly, in loop we need to execute (K*M*N) ops. Vectorization (two 8-bit ops in one 16-bit) can decrease it in half, but it needs K*(M+N) ops to prepare data for using intrinsic (smlad here). So for M=1 all profit from vectorization will be consumed by preparations.