[RFC][TOP][BYOC] Intel LIBXSMM Integration

Motivation

TVM has shown satisfactory performance on MLP models with CPU. However there are still some defects in the assembly code generated by LLVM which block AutoTVM/AutoScheduler from achieving optimal on GEMM.

LIBXSMM is a open source library developed by Intel Lab for accelerating small matrix multiplication. It leverages the JIT code generator to generate high efficient GEMM kernels for x86 CPU, which could be very close to hardware rootline. According to our evaluation, in “small” GEMM (cube_root(m, n, k) <= 256) , LIBXSMM shows a superior performance over the well-known BLAS library Intel MKL:

This evaluation was run on Intel clx-8255c, which has a peak performance of 153 GFLOPS for single core. We can see LIBXSMM surpasses MKL in all shapes, and already very close to peak.

By the way, given that LIBXSMM can generate quite efficient GEMM kernel implementation, it is also an ideal substitution for inner-kernel of normal size GEMM. According our experiments, the AutoTVM templates we wrote with LIBXSMM as register-block generation, has a much higher performance comparing to MKL or existing TOPI implementation:

This result was collected on a real-world model, which runs on Intel clx-8255c. Every instance is assigned 6 core each, so the peak performance would be 153 GFLOP x 6 = 918 GFLOPS. We can see that the libxsmm implementation outperforms MKL or existing AutoTVM by roughly 2~3 times in almost all shapes. Finally the overall improvement for the model is 2.3X.

Proposal

This proposal aims to integrate LIBXSMM into TVM to accelerate small GEMM and serve as inner-kernel to accelerate normal size GEMM.

We propose to integrate LIBXSMM with TVM in following 3 components:

  1. Add extern call “tvm.contrib.libxsmm.gemm” in “contrib” directory;
  2. Use BYOC to accelerate small GEMM (cube_root(m, n, k ) <= 256);
  3. Integrate our AutoTVM template into TOPI, as a GEMM implementation candidate.
5 Likes

Thanks for the proposal! It’s definitely super helpful to integration libxsmm into TVM!

To make sure it’s accurate, by

Add extern call “tvm.contrib.libxsmm.gemm” in “contrib” directory;

we are referring to src/runtime/contrib. Is that correct?

Exactly. Actually, we’d like to add python interface in python/tvm/contrib and C++ implementation in src/runtime/contrib. Thank you, Junru.

Thanks for the proposal and it does useful. One question I have is whether LIBXSMM supports epilogue (e.g., ReLU, bias add) after GEMM, or it just focuses on single GEMM operator? If it just focuses on a single operator, then it seems not necessary to use BYOC but could just simply register them as another extern schedule such as CBLAS.

Yes, LIBXSMM does have epilogue support now, which includes bias, relu, softmax, tanh and gelu fusion. So I think BYOC would help. Thank you for the suggestion!

1 Like

Thanks for the clarification. Yes in this case BYOC would be a more reasonable solution. Meanwhile, could you file an official RFC here for your proposal? It would definitely make your integration process much more smooth.

@zhuwenxi Are you going to add quantized kernels as well? AVX512 + VNNI? I wonder how they compare against onednn kernels.

CC @denise if you are interested :slight_smile:

1 Like

I am very interested! For those who don’t already know, I have experience working with LIBXSMM from my time at Intel, and some of this work landed in a SC21 paper which I coauthored.

The paper comes from Intel itself and has data/cites a lot of sources which prove that the LIBXSMM approach beats OneDNN in many cases. It’s exciting that the community is considering adding LIBXSMM as a BYOC target. I think it will bring great benefits to TVM’s Intel CPU performance!

1 Like

Sure, I’ll fire an official RFC soon.

LIBXSMM does have quantized kernels using VNNI, but I haven’t evaluate them against onednn’s int8 kernels. I think it’s a good idea to add quantized kernels as well, let’s put it in our future plan.

Thank you for the information, Denise!. I remeber you’ve mentioned that you already tried to integrate LIBXSMM into LLVM before, is that work going on well?

Which version of oneDNN was used for testing? The version 2.5 was improved for avx512 case.

We’re actually comparing with MKL, rather than oneDNN. The MKL version we used is from latest oneAPI package.

Shall we conclude this pre-RFC and send a formal RFC to https://github.com/apache/tvm-rfcs/?

The RFC has been merged: rfcs/0046-Intel-LIBXSMM-integration.md

3 Likes

great, next time it would be good to link the official RFC when it is opened, so we have a good context

Yeah I was on vacation and didn’t track closely. Sorry for the confusion!