[RFC][BYOC] Intel(R) oneDNN Integration

crazydemo · November 29, 2021, 8:00am

Summary

This RFC proposes to integrate DNNL into TVM via BYOC framework. The drawback of the current “Bring DNNL to TVM via DNNL JSON codegen/runtime” is analysed and has been enhanced. Performance benefits are observed by comparing with either MXNet-oneDNN or TVM-autoscheduler on several popular workloads.

Motivation

TVM has shown its good performance on many CV models. One of the major advantages is the maximizing throughput which benefits from the small overhead. However, tuning is needed for each new shape, and it usually takes long time.

oneDNN is an open-source cross-platform performance library of basic building blocks for deep learning applications. The library is optimized for Intel(R) Architecture Processors, Intel(R) Processor Graphics and Xe Architecture graphics. Given a new shape and the env config, oneDNN is able to infer the optimal data format immediately. In order to take the advantage of small overhead of TVM, and achieve the best performance on CPU in a short time, we propose to integrate oneDNN into TVM via BYOC framework.

Currently, the BYOC homepage provides a simple example of integrating DNNL(naming to oneDNN nowadays) into TVM, but the performance is far away from both TVM autoscheduler and MXNet-oneDNN due to the following main reasons:

Non-optimal layout was used in dnnl ops.
Insufficient subgraph partitioning.
Unnecessary overhead due to memory copy from tensor to dnnl memory buffer or vice versa.

Status

We have already solved the above issues and observed the performance benefits by comparing with either MXNet-oneDNN or TVM-autoscheduler on several popular workloads like ResNet50_v1b, InceptionV3, VGG11_bn in several scenarios including latency (Figure 1, single instance with 28 cores and bs=1), throughput (Figure 2, single instance with 28 core and bs=32) and real-time (Figure 3, 7 instances with 4core per each and bs=1) mode.

*Note

Hardware config

Intel(R) Xeon(R) Platinum 8280L CPU @ 2.70GHz

Compilation config

g++ 7
‘llvm -mcpu=cascadelake -model=platinum-8280’
TVM commitID: 19b23b9
MXNet version: V1.8.0
OneDNN version: V1.7 / V2.4

Runtime config

20 warm-up and 100 batches

Proposal

This proposal aims to provide a new approach to integrate oneDNN into TVM via DNNL JSON codegen/runtime by applying the following adjustments to tackle the aforementioned issues:

Register a new “alter_op_layout” function for dnnl to get the optimal layouts for dnnl ops with a new layout auto-query function in Relay.
Add a custom pass to rewrite “Conv-Add-Add- ReLu” pattern into “Conv-Add- ReLu” to better handle the pattern comes from BatchNorm Folding (“Conv-bias_add-BN-ReLu”).
Add a new pattern “Conv-Add-Sum-ReLu” for the fusion.
Remove the unnecessary memory copy in “dnnl_json_runtime.cc” with pointer assignment only.

We have enhanced and updated the support. Currently, the following ops/post-op fusion/datatype are enhanced/added, as well as some CV models are verified with the new oneDNN backend, we’re going to cover more ops/datatypes and models (denoted with *) in the next step.

Ops

nn.conv2d
nn.dense
nn.relu
nn.max_pool2d
nn.avg_pool2d
matrix multiplication *
nn.conv1d *
nn.conv3d *
depthwise conv *

Post-Op Fusions

conv2d_bias_sum_relu
conv2d_bias_relu
conv2d_bias
dense_bias_relu
dense_bias
Eltwise Post-op *
Depthwise *
Binary *
PReLu *

Datatype

Float32
BF16 *
INT8 *

Verified CV Models (from gluoncv)

ResNet 18, 32, 50, 101, 152
VGG 11, 13, 16, 19; VGG_BN 11, 13, 16, 19
InceptionV3
Mobilenet *
Bert *

Thanks! Any ideas or suggestions are welcome!

comaniac · November 29, 2021, 5:50pm

Thanks for the RFC! The proposal makes lots of sense to me. Some questions:

From Figure 1 and Figure 2, seems like v1.7 performs better with batch size 1 while v2.4 is better with batch size 32. Why oneDNN v1.7 and oneDNN v2.4 have obvious performance gap, and which one should we recommend for users?
For data types other than float32, are you planning to following the data type defined in Relay, or you could support partial quantization? For example, when running a model in float32, are you somehow able to use INT8 for BF16 in the partitioned functions?

junrushao · November 30, 2021, 2:47am

Thanks for the RFC! I like this proposal! Would love to see more fine-grained ablation study as well

turbo0628 · November 30, 2021, 3:55am

Thanks for the RFC!

The benchmark results disagree with our tests on 8255c, our AutoScheduler didn’t perform that well compared with oneDNN. Is that possible to write a reproduction guide or ablation study as @junrushao suggested?

crazydemo · November 30, 2021, 8:22am

Thank you for your comment.

We notice the performance gap between oneDNN v1.7 and v2.4 as well. The experiment results show that v1.7 performs better than v2.4 only under latency scenario, while v2.4 achieves the best performance under both throughput and real-time scenario.
We consider it is caused by the difference beween the preferred layout of the two version. V1.7 prefers “NCHWxc” and “OIHWxixo” layout for data and weight of convolution. V2.4 uses channel last layout “NHWC” and “OHWIxo” for more convenient use.
We are now working on enabling BF16 model. @yangulei will give some details.

yangulei · November 30, 2021, 8:31am

We are working on other data types, the support for bfloat16 is almost done and under further testing. We first convert the relay graph to bfloat16 using AMP, then parts of the graph could be consumed by oneDNN BYOC. We have tested bfloat16 mode with ResNet50v1b and got comparable results with respect to float32 mode.

crazydemo · November 30, 2021, 8:34am

Thank you for your comment.

The tuning and benchmark code is released here: https://github.com/crazydemo/TLCBench/tree/cascadelake.

You can run the tune_autoscheduler.sh and benchmark_autoscheduler.sh with TVM latest master. Our test is based on TVM commitID: 19b23b9.

apeskov · November 30, 2021, 4:13pm

Thank you for RFC!

Absolutely agree that current state of DNNL BYOC integration is proof of concept and far from potential performance offered by DNNL. I also trying to improve some aspects you mention above(like zero copy memory handling and enhanced fusing patterns). My primary goal is to enable int8 scoring through DNNL but a lot of accompanying improvement relevant not only for int8 are required.

I took a look at your code and have some questions.

You introduced a new one fusing pattern Conv+Bias+Sum+Relu. That’s a advanced pattern because it requires a proper tensor handling. I mean src[3] and dst[0] should be the same external memory buffer (if you would like to keep zero copy benefits). How did you achieve that?
I also try to do that, but I had to introduce a changes on in TVM memory allocation level. But as I see your code changes nothing outside of DNNL byoc part.
Layout selection via ‘LayoutQuery’ func is a good idea. But unfortunately that will works only if you will build and run on same machine (or similar). Successful cross compilation will be difficult to achieve with that approach.
Also there is a problem with current implementation of query api. There is an inconvenience of DNNL api, list of available primitive descriptors for convolution and convolution+attributes is not identical. So additional attributes with post ops may change a preferred layout.
I guess alternative way is to use some kind of opaque tensors and move layout selection to runtime stage.
About special BN merge pass. Wow, I would be very surprised if TVM has no pass to merge sequential linear operator into one. Definitely that’s a good point and should be available out of box.

If you are interested in my changes you may take a look on it here:

apeskov · December 1, 2021, 1:46am

Here is a PR with int8 support for DNNL runtime. I guess will may be helpful in context of this RFC.

github.com/apache/tvm

[DNNL] Add support of QNN primitives for DNNL runtime

apache:main ← apeskov:ap/dnnl-int8-up

opened 01:36AM - 01 Dec 21 UTC

apeskov

+2806 -385

The main value of that change is enable qnn.conv2d and qnn.dense primitive for D…NNL base json runtime. Some of these changes is useful for all type of workloads, not only int8 specific. Together with that there was performed some refactoring of internal infrastructure of DNNL plugin. The main int8 unrelated changes are: * Improved thread safety. Now DNNL runtime can be used in multi instance mode. * Zero copy input/output handling * Scratchpad specification * Use DNNL query api to define proper layouts (additional data copy is possible) * Indirect addressing of memory objects. Internal tensor registry allow to clone temp/const tensor depending particular thread id. * Relative positioning of input arguments. Index of each optional arguments specified via attributes * Introduced ability of calculating constant subgraphs in DNNL code generator stage

crazydemo · December 1, 2021, 3:13am

Thank you for your comment and suggestions!

Currently, I only considered the case where post-op is in-placable. I just bind the entryID of src[3] and dst[0] to the same memory, the corresponding code can be found in dnnl_json_runtime.cc:282-286. This solution is not robust enough. I originally planned to judge whether the tensor is in-placable in run(), and then do the memory binding, but this may not be able to keep zero copy. I think do some modification on TVM memory allocation level can be a general solution.
I do not notice that “list of available primitive descriptors for convolution and convolution+attributes is not identical”. This finding can change our solution. I will check it up.
I am confused about there is no pass to merge two linear operator as well. The pattern is like

  %0 = nn.conv2d(%data, meta[relay.Constant][0] /* ty=Tensor[(64, 3, 3, 3), float32] */, padding=[1, 1, 1, 1], channels=64, kernel_size=[3, 3]) /* ty=Tensor[(1, 64, 224, 224), float32] */;
  %1 = add(%0, meta[relay.Constant][1] /* ty=Tensor[(64, 1, 1), float32] */) /* ty=Tensor[(1, 64, 224, 224), float32] */;
  %2 = add(%1, meta[relay.Constant][2] /* ty=Tensor[(64, 1, 1), float32] */) /* ty=Tensor[(1, 64, 224, 224), float32] */;
  %3 = nn.relu(%2) /* ty=Tensor[(1, 64, 224, 224), float32] */;

Only when I convert the pattern into

  %0 = nn.conv2d(%data, meta[relay.Constant][0] /* ty=Tensor[(64, 3, 3, 3), float32] */, padding=[1, 1, 1, 1], channels=64, kernel_size=[3, 3]) /* ty=Tensor[(1, 64, 224, 224), float32] */;
  %1 = add(meta[relay.Constant][1] /* ty=Tensor[(64, 1, 1), float32] */, meta[relay.Constant][2] /* ty=Tensor[(64, 1, 1), float32] */) /* ty=Tensor[(64, 1, 1), float32] */;
  %2 = add(%0, %1) /* ty=Tensor[(1, 64, 224, 224), float32] */;
  %3 = nn.relu(%2) /* ty=Tensor[(1, 64, 224, 224), float32] */;

Then I can apply constant_folding to remove the %1 add. And the result can be:

  %0 = nn.conv2d(%data, meta[relay.Constant][0] /* ty=Tensor[(64, 3, 3, 3), float32] */, padding=[1, 1, 1, 1], channels=64, kernel_size=[3, 3]) /* ty=Tensor[(1, 64, 224, 224), float32] */;
  %1 = add(%0, meta[relay.Constant][1] /* ty=Tensor[(64, 1, 1), float32] */) /* ty=Tensor[(1, 64, 224, 224), float32] */;
  %2 = nn.relu(%1) /* ty=Tensor[(1, 64, 224, 224), float32] */;

If you find any pass can handle this case, please share with me!

BTW, this case only happens in VGG_BN series models.

masahi · December 1, 2021, 8:21am

I don’t think we have a pass that does such rewrite. Can you send a PR to add that pass separately from the main one (oneDNN integration)? I think it is very useful. I’m also working on another BYOC which only allows one “bias-like” tensor in a post op.

comaniac · December 1, 2021, 5:43pm

This looks like the job for SimplifyExpr or FoldConstant. @crazydemo you could either add a pattern to SimplifyExpr to deal with a series of add with constants, or improve FoldConstant to fold this case.

crazydemo · December 3, 2021, 7:45am

Thank you for your suggestion. I will add a pattern in SimplifyExpr and use FoldConstant to tackle this case.

crazydemo · December 3, 2021, 7:47am

Thank you for your approval, we will submit the PR very soon.

crazydemo · December 7, 2021, 8:54am

The related PR has been submitted. Add pattern in pass SimplifyExpr