Hello! Suppose my CPU supports AVX2 which supports operations with 256-bit registers (8 FP32 operands). Does that mean in AutoTVM we can always config like
# (suppose the length of x is 32)
xo, xi = s[A].split(x, factors=8)
s[A].unroll(xo)
s[A].vectorize(xi)
so that we can avoid searching the split of the x axis? Does a direct vectorization on x, like
s[A].vectorize(x)
generate different assembly codes and have a performance different from the above example?
If you know the optimal split size (e.g. from the size of register), you can split directly without searching.
While s[A].vectorize(x) means vectorize the whole loop, which is impossible in many cases. On CPU, LLVM will decide how to handle such vectorization
Just to make it clear, are you indicating that LLVM might come up with a schedule with better performance for s[A].vectorize(x) than splitting x with factor = the register size? Or LLVM will automatically generate the same schedule as the split version?
If the vectorization is impossible due to hardware constraints, at the worst case it may generate ordinary loop (even if you specify as vectorized in TVM IR)