Why reorder and try_unroll_vec on Skylake: any details?

moderato · February 12, 2020, 8:06pm

Hello! I see in the following tutorial:

https://zhuanlan.zhihu.com/p/75203171

it says that reorder and try_unroll_vec are needed for architectures like Skylake. I wonder if there are any detailed explanations about this decision? Since in this example reorder and try_unroll_vec are applied to local accumulation of a small block of output, I wonder if this is the only place that these functions should be applied? Or there are other situations we should apply these functions to?

Thanks in advance!

moderato · February 15, 2020, 8:19am

@vinx13 Can you take a look at this question? Thanks!

vinx13 · February 16, 2020, 1:30am

Usually we tried to unroll the innermost loop, reorder can be applied to loops to improve loop locality. But they are not necessarily the only choice, it can be applied to other places if you think that may improve performance. Usually making the search space larger is helpful (except that it may increase search time) . Maybe @FrozenGene can provide more details

FrozenGene · February 21, 2020, 4:58am

Thanks for the interest in this article (I am the author of this article). @vinx13 answer this correctly. It is not restricted on skylake. try_reorder and try_unroll_vec is to make AutoTVM tune and try to find the best loop order / schedule primitives so that we could improve loop locality and performance. try_unroll_vec will decide whether we unroll / vectorize axis. For example, unroll maybe is not always better than unroll, because unroll will result in producing more instructions and maybe can not be hold in instruction cache so that we get worse performance. So try_unroll_vec is one better solution.

moderato · February 21, 2020, 9:16am

@vinx13 @FrozenGene Thank you both for the answers! In fact, the size of the search space is one of my concerns. I totally understand we always wanna try more possibilities, but I’m also curious that whether the architecture’s information, e.g. Skylake’s microarchitecture, can help us make a few higher-level decisions on the schedule so as to keep the search space size reasonable. Any thoughts here?

FrozenGene · February 21, 2020, 10:02am

I don’t think we could assume it. Because we will meet different split factor so that the inner axises information is different. This is why we need try_unroll_vec. The axis of vectorize we could assume we will vectorize the most inner axis, I buy in it. But unroll we couldn’t do assume. You could try_unroll the axis.

moderato · February 21, 2020, 7:38pm

OK, I see. I can think of an example case that unrolling might outperform vectorization when say, the innermost axis is not divisible to 8 FP32 on an AXV2 machine like mine. It’s said that mixing AVX and non-AVX instructions will be penalized. I suppose it’s close to what you’re saying about try_unroll_vec right?

Speaking of this, I do have another vectorization-related question posted recently. It’s about TVM doesn’t vectorize the innermost if the split factor of the second innermost is not divisible to the original axis length. Could you take a look as well?

Sorry for bugging you with so many questions. I do appreciate your time looking into them!