Hello! I found that TVM doesn’t vectorize when the split factor is not divisible by the axis length. I discussed it in the following post with @FrozenGene.
This problem sometimes greatly affects the runtime performance and leaves the users very few choices of, for example, the smallest block size in GEMM, because only a limited number of such choices can result in vectorization in code generation.
Here’s an article discussing GEMM optimization on an AVX2 machine: https://gist.github.com/nadavrot/5b35d44e8ba3dd718e595e40184d03f0, mentioning sometimes the best smallest block size in GEMM implemented with instruction sets like AVX2 is somewhat “uncommon”, e.g. 2x5, 3x4, etc.
I wonder if TVM has any plan of making a new feature to avoid this situation? Any advice is appreciated!