TVM and BLAS libraries

If I build TVM with openblas or atlas and specify -libs=cblas, then inference speed gets worse than baseline (~3%), while I don’t see any change when building with MKL or MKL-DNN.

You might want to experiment with environment variables that impact MKL. For example, OMP_NUM_THREADS, KMP_BLOCKTIME, OMP_NESTED, etc.

@jonso My inference is running single-threaded, I don’t think those env variables would help, right?

@haichen is it normal to get exactly the same performance w/ and w/o specifying -libs=cblas when TVM is built with MKL support? Maybe MKL isn’t used at all?

@comaniac if no external library is specified in the target, which backend does TVM use?

I don’t think you will see much difference with a single thread. Libraries like MKL are built to exploit parallelism.

Personally, I see a significant difference with and without -libs=cblas when using multiple threads.

1 Like

It depends. What’s your workload? Is that just a dense op or an entire network? Dense may not be the performance bottleneck if you profile an entire network so that the impact of using CBLAS would be moderated. I did an experiment months ago using 512x512 matrix to perform dense with and without CBLAS. The one with CBLAS is ~1.25x faster.

If no external library specified (e.g., llvm), TVM will generate LLVM IR and then lowers to the machine code directly. In this case, it’s suggested to specify at least the CPU model to make TVM use AVX instructions. For example: llvm -mcpu=skylake-avx512.

@comaniac The workload is a complete network, ResNet-like. There’s only one dense layer at the very top. When building TVM with openblas/atlas and enabling -libs=cblas performance gets worse by ~3%. When building TVM with MKL, there’s no difference between performance w/ or w/o -libs=cblas. This behaviour is kind of unexpected imho.

By the way, I’m already specifying -mcpu in the target.

For most neural-network cases, we would expect tvm to work better than blas cases if the workload is not typical, as the generated code can benefit from things like operator fusion and shape specific tuning.

1 Like

@tqchen Got it, thank you! In my case, I’d still expect to experience different performance when using blas compared to TVM baseline, but this doesn’t happen when using MKL or MKL-DNN blas, almost like those blas library aren’t used or computation falls back to TVM one.

Any other comment on this?

@haichen can you share what kind of performance improvement you obtained by using MKL-DNN backend?

From my experience on BERT base model performance, on EC2 C5.4xlarge instance, I can reduce the latency from 93ms to 52ms by just changing the dense from topi implementation (after tuned by AutoTVM) to MKL-DNN with OpenMP thread pool. In particular, the total latency of all dense ops in BERT reduces from 65.7ms to 29.2ms.

1 Like

This just by enabling MKL-DNN backend when building TVM and adding -libs=cblas in the target?

Yes, that’s correct. Also set USE_OPENMP to gnu.

1 Like

@haichen, could u kindly share your protocols to reproduce bert base model ?

I plan to write a blog about how to reproduce bert base model performance using TVM. I’ll let you know after I post it.

1 Like

@haichen @gasgallo I have the same situation with @gasgallo. There is no performance improvement with mkl-dnn. The model is a UNet cnn model and here are some options in config.cmake

set(USE_BLAS mkl)
set(USE_MKL_PATH /home/abc/sdk/intel/mkl)
set(USE_MKLDNN /home/abc/sdk/dnnl_lnx_1.1.1_cpu_gomp)
set(USE_OPENMP gnu)
  1. When runing with llvm, the inference time is about 400ms.
  2. When runing with llvm -libs=cblas, the inference time is about 400ms. NO improvement
  3. When runing with llvm -mcpu=skylake, the inference time is about 200ms. large improvement

It seems mkl-dnn doesn’t work, however when use MXNet framework with mkl-dnn, it does brings big improvement

Currently USE_MKLDNN can be only ON or OFF. It doesn’t support customized library path. It relies on cmake to find the MKLDNN library location. See here.

If MKLDNN is enabled, you should find the following line in the cmake output:

Use MKLDNN library /path/to/mkldnn

@haichen Thanks for the info, My MKLDNN path is /home/abc/sdk/dnnl_lnx_1.1.1_cpu_gomp, when I turn set(USE_MKLDNN on), theren is NO Use MKLDNN library /path/to/mkldnn log in the cmake output. It seems MKLDNN is not found. How can I set MKLDNN path using cmake params?

@7oud Sorry about the late response. I pushed an update on this and now you can specify a customized location to the MKLDNN library.

Any updates ? Get similar results here.