I recently tried out the mprofile
schedules using tensorization to replace the innermost loops of Convolutions, dense and pooling layers with hardcoded C- microkernels (See https://github.com/apache/tvm/pull/9233). A few questions came up and I would be happy if someone could answer those.
- Is there a specific reason why support for depthwise convolutions was not implemented? Maybe it would be a good idea to add schedules for this as well.
- The microkernels contain some threshold values to fallback to a loop-based implementation if some input dimensions are too small. Which approach was followed to find the optimal values for these thresholds on any device?
- I was wondering why the DSP optimized variant of the “nn.dense” operator was never invoked for several TFLite model. After having a look at the schedule definition (https://github.com/apache/tvm/blob/fc419df32f052e21f614c8940699c10a2d696689/python/tvm/topi/arm_cpu/mprofile/dsp/dense.py#L41) and microkernel (https://github.com/apache/tvm/blob/fc419df32f052e21f614c8940699c10a2d696689/python/tvm/topi/arm_cpu/mprofile/dsp/micro_kernel/gemm.py#L474) it turned out that a batch (M) size greater than 2 (int16) or 16 (int8) is required to make use of the optimized code. it there a specific reason for this? If a GEMM microkernel does not perform well for the inference of a fully connected layer, wouldn’t it make more sense to use a simple dot-product instead?
- When commenting the threshold check for the int8 gemm microkernel (see https://github.com/apache/tvm/blob/fc419df32f052e21f614c8940699c10a2d696689/python/tvm/topi/arm_cpu/mprofile/dsp/micro_kernel/gemm.py#L316) to enable accelerated “nn.dense” support given a batch_size of 1, I ran into a major issue: The stack got corrupted when using the
toycar.tflite
as the microkernels are allocation a local scratch pad bufferint16_t bb_pad[{bb_pad_size}];
of size (N x K = 640 x 128) which exceeds the used stack size of 16kB. I could have fixed this by increasing the stack size of my simulator but that seems a bit unrealistic. - I can see that the DSP-optimized schedules for the supported operators have specific requirements on the data and kernel layouts used in the model, thus using the “ConvertLayout” Pass might be required to actually used these schedules. Unfortunately this can not be achieved using the TVMC command like, as the “—desired-layout” is quite limited in it’s current form: (A) Only affects convolutions, thus pooling layouts can not be changed, (B) Non-default kernel layout (like HWOI for NHWC conv2d as required for the SIMD microkernels) can not be enforced via the CLI. (C) A single data layout is used for the complete graph, thus you wouldn’t be able to enforce NCHW for nn.avg_pool2d and NHWC for nn.conv2d at the same time.
- Currently during the legalization pass of “qnn.conv2d” for arm_cpu device the 8-bit inputs in a TFLite pre-quantized model are casted to 16-bit integers unless
is_fast_int8_on_arm
oris_aarch64_arm
. This reduces the achievable speedup using the implemented microkernels. Can we avoid this by checking for the availability of the DSP extensions inside https://github.com/apache/tvm/blob/97b3076c3532f73a9d9eeba26a3f329f8e0f803d/python/tvm/relay/qnn/op/legalizations.py#L405?
I would really appreciate any answers and am looking forward to improve the usability of this feature in the future.
CCing those who have been involved in the PR: @areusch @Mousius @ilyag-grovety @mehrdadh (+ sergey-grovety, GermanTretiakov, Alex-grovety, u99127 which I could not found on Discuss)