I am working with Intel LLVM team to support VNNI instructions code generation. A mid point in this goal is to support vpmaddwd instruction.
The motivation code is
static const int N = 128;
int16_t A[2*N];
int16_t B[2*N];
int C[N];
for (int i = 0; i != N; ++i)
C[i] = A[2*i]*B[2*i] + A[2*i+1]*B[2*i+1];
// Each iteration translates to vpmaddwd instruction
// Takes two sets of 2 16 bit values - |a0|a1| and |b0|b1| and computes |a0*b0 + a1*b1|
// while ensuring that the computation happens in 32 bits in HW
// Command -> clang++ exp.cpp -mavx512bw -O3 -S (trunk LLVM)
Intel LLVM team support code generation by IR pattern matching (https://reviews.llvm.org/D49636)
However, IR generated by TVM+LLVM is a totally different IR, though semantically same, causing the pattern matching to fail.
The relevant TVM code is
A = tvm.placeholder((N,), name='A', dtype='int16')
B = tvm.placeholder((N,), name='B', dtype='int16')
C = tvm.compute((N/2,), lambda i: (A[2*i].astype('int32') * B[2*i].astype('int32')) + (A[2*i + 1].astype('int32') * B[2*i + 1].astype('int32')), name='C')
s = tvm.create_schedule(C.op);
oi, ii = s[C].split(s[C].op.axis[0], factor=16)
s[C].vectorize(ii)
print(tvm.lower(s, [A, B, C], simple_mode=True));
target = 'llvm -mcpu=skylake-avx512'
ctx = tvm.context(target, 0);
a = tvm.nd.array(np.ones((N, ), dtype='int16'), ctx);
b = tvm.nd.array(np.ones((N, ), dtype='int16'), ctx);
c = tvm.nd.array(np.zeros((N/2, ), dtype='int16'), ctx);
func = tvm.build(s, [A, B, C], target, name='mmult')
func.save("baseline.s");
func.save("baseline.ll");
The key different in clang generated IR and TVM generated IR is that clang-IR is more optimized leading to vector loads + shuffle instructions. The TVM-IR loads scalar one by one. Intel LLVM does not support this type of pattern matching.
So, the question is - Should the IR be optimized to perform vector loads + shuffle in TVM?
Or should LLVM backend support pattern matching for all possible combinations?
Or more broadly, where do we draw the line?