Hello! I found that TVM doesn’t vectorize when the split factor is not divisible by the axis length. I discussed it in the following post with @FrozenGene.
This problem sometimes greatly affects the runtime performance and leaves the users very few choices of, for example, the smallest block size in GEMM, because only a limited number of such choices can result in vectorization in code generation.
Here’s an article discussing GEMM optimization on an AVX2 machine: Efficient matrix multiplication · GitHub, mentioning
sometimes the best smallest block size in GEMM implemented with instruction sets like AVX2 is somewhat “uncommon”, e.g. 2x5, 3x4, etc.
I wonder if TVM has any plan of making a new feature to avoid this situation? Any advice is appreciated!
Thank you for your reply! Is this gonna be a fix that involves a lot of changes? In the above post, @FrozenGene suggested a solution in Halide with a similar idea: