OK, I see. I can think of an example case that unrolling might outperform vectorization when say, the innermost axis is not divisible to 8 FP32 on an AXV2 machine like mine. It’s said that mixing AVX and non-AVX instructions will be penalized. I suppose it’s close to what you’re saying about try_unroll_vec right?
Speaking of this, I do have another vectorization-related question posted recently. It’s about TVM doesn’t vectorize the innermost if the split factor of the second innermost is not divisible to the original axis length. Could you take a look as well?
Sorry for bugging you with so many questions. I do appreciate your time looking into them!