TVM and BLAS libraries

gasgallo · November 14, 2019, 10:02am

If I build TVM with openblas or atlas and specify -libs=cblas, then inference speed gets worse than baseline (~3%), while I don’t see any change when building with MKL or MKL-DNN.

jonso · November 14, 2019, 3:36pm

You might want to experiment with environment variables that impact MKL. For example, OMP_NUM_THREADS, KMP_BLOCKTIME, OMP_NESTED, etc.

gasgallo · November 15, 2019, 2:39am

@jonso My inference is running single-threaded, I don’t think those env variables would help, right?

gasgallo · November 15, 2019, 2:42am

@haichen is it normal to get exactly the same performance w/ and w/o specifying -libs=cblas when TVM is built with MKL support? Maybe MKL isn’t used at all?

@comaniac if no external library is specified in the target, which backend does TVM use?

jonso · November 15, 2019, 3:05am

I don’t think you will see much difference with a single thread. Libraries like MKL are built to exploit parallelism.

Personally, I see a significant difference with and without -libs=cblas when using multiple threads.

comaniac · November 15, 2019, 3:06am

It depends. What’s your workload? Is that just a dense op or an entire network? Dense may not be the performance bottleneck if you profile an entire network so that the impact of using CBLAS would be moderated. I did an experiment months ago using 512x512 matrix to perform dense with and without CBLAS. The one with CBLAS is ~1.25x faster.

If no external library specified (e.g., llvm), TVM will generate LLVM IR and then lowers to the machine code directly. In this case, it’s suggested to specify at least the CPU model to make TVM use AVX instructions. For example: llvm -mcpu=skylake-avx512.

gasgallo · November 15, 2019, 4:38am

@comaniac The workload is a complete network, ResNet-like. There’s only one dense layer at the very top. When building TVM with openblas/atlas and enabling -libs=cblas performance gets worse by ~3%. When building TVM with MKL, there’s no difference between performance w/ or w/o -libs=cblas. This behaviour is kind of unexpected imho.

By the way, I’m already specifying -mcpu in the target.

tqchen · November 15, 2019, 3:38am

For most neural-network cases, we would expect tvm to work better than blas cases if the workload is not typical, as the generated code can benefit from things like operator fusion and shape specific tuning.

gasgallo · November 15, 2019, 4:42am

@tqchen Got it, thank you! In my case, I’d still expect to experience different performance when using blas compared to TVM baseline, but this doesn’t happen when using MKL or MKL-DNN blas, almost like those blas library aren’t used or computation falls back to TVM one.

gasgallo · November 20, 2019, 6:51am

Any other comment on this?

@haichen can you share what kind of performance improvement you obtained by using MKL-DNN backend?

haichen · November 20, 2019, 9:31pm

From my experience on BERT base model performance, on EC2 C5.4xlarge instance, I can reduce the latency from 93ms to 52ms by just changing the dense from topi implementation (after tuned by AutoTVM) to MKL-DNN with OpenMP thread pool. In particular, the total latency of all dense ops in BERT reduces from 65.7ms to 29.2ms.

gasgallo · November 21, 2019, 1:53am

This just by enabling MKL-DNN backend when building TVM and adding -libs=cblas in the target?

haichen · November 21, 2019, 6:05am

Yes, that’s correct. Also set USE_OPENMP to gnu.

KimBioInfoStudio · November 21, 2019, 7:29am

@haichen, could u kindly share your protocols to reproduce bert base model ?

haichen · November 21, 2019, 11:17pm

I plan to write a blog about how to reproduce bert base model performance using TVM. I’ll let you know after I post it.

7oud · January 14, 2020, 3:28am

@haichen @gasgallo I have the same situation with @gasgallo. There is no performance improvement with mkl-dnn. The model is a UNet cnn model and here are some options in config.cmake

set(USE_BLAS mkl)
set(USE_MKL_PATH /home/abc/sdk/intel/mkl)
set(USE_MKLDNN /home/abc/sdk/dnnl_lnx_1.1.1_cpu_gomp)
set(USE_OPENMP gnu)

When runing with llvm, the inference time is about 400ms.
When runing with llvm -libs=cblas, the inference time is about 400ms. NO improvement
When runing with llvm -mcpu=skylake, the inference time is about 200ms. large improvement

It seems mkl-dnn doesn’t work, however when use MXNet framework with mkl-dnn, it does brings big improvement

haichen · January 18, 2020, 6:49am

Currently USE_MKLDNN can be only ON or OFF. It doesn’t support customized library path. It relies on cmake to find the MKLDNN library location. See here.

If MKLDNN is enabled, you should find the following line in the cmake output:

Use MKLDNN library /path/to/mkldnn

7oud · January 19, 2020, 6:39am

@haichen Thanks for the info, My MKLDNN path is /home/abc/sdk/dnnl_lnx_1.1.1_cpu_gomp, when I turn set(USE_MKLDNN on), theren is NO Use MKLDNN library /path/to/mkldnn log in the cmake output. It seems MKLDNN is not found. How can I set MKLDNN path using cmake params?

haichen · February 4, 2020, 9:52pm

@7oud Sorry about the late response. I pushed an update on this and now you can specify a customized location to the MKLDNN library.

nicklhy · June 1, 2021, 6:32am

Any updates ? Get similar results here.