Motivation
TVM has shown satisfactory performance on MLP models with CPU. However there are still some defects in the assembly code generated by LLVM which block AutoTVM/AutoScheduler from achieving optimal on GEMM.
LIBXSMM is a open source library developed by Intel Lab for accelerating small matrix multiplication. It leverages the JIT code generator to generate high efficient GEMM kernels for x86 CPU, which could be very close to hardware rootline. According to our evaluation, in “small” GEMM (cube_root(m, n, k) <= 256) , LIBXSMM shows a superior performance over the well-known BLAS library Intel MKL:
This evaluation was run on Intel clx-8255c, which has a peak performance of 153 GFLOPS for single core. We can see LIBXSMM surpasses MKL in all shapes, and already very close to peak.
By the way, given that LIBXSMM can generate quite efficient GEMM kernel implementation, it is also an ideal substitution for inner-kernel of normal size GEMM. According our experiments, the AutoTVM templates we wrote with LIBXSMM as register-block generation, has a much higher performance comparing to MKL or existing TOPI implementation:
This result was collected on a real-world model, which runs on Intel clx-8255c. Every instance is assigned 6 core each, so the peak performance would be 153 GFLOP x 6 = 918 GFLOPS. We can see that the libxsmm implementation outperforms MKL or existing AutoTVM by roughly 2~3 times in almost all shapes. Finally the overall improvement for the model is 2.3X.
Proposal
This proposal aims to integrate LIBXSMM into TVM to accelerate small GEMM and serve as inner-kernel to accelerate normal size GEMM.
We propose to integrate LIBXSMM with TVM in following 3 components:
- Add extern call “tvm.contrib.libxsmm.gemm” in “contrib” directory;
- Use BYOC to accelerate small GEMM (cube_root(m, n, k ) <= 256);
- Integrate our AutoTVM template into TOPI, as a GEMM implementation candidate.