[RFC][Tensor Core] Optimization of CNNs on Tensor Core

Introduction

The optimizations were built on existing work of @Hzfengsy, see #4052 for details. We added tensor core enabled conv2d, dense, and Tensor Core instructions in Topi, and modified codes in Relay to enable autoTVM on parameters regarding Tensor Core.

Following major functions were added:

1, Conv2d_nhwc_tensorcore: A module that optimized Conv2d on Tensor Core with NHWC layout. Both Fp16 and Fp32 inputs and outputs are supported.

2, Conv2d_nhwc_direct: A module that implemented NHWC layout for Conv2d. Currently, Conv2d on Tensor Core only supported specific shapes of batch size, input channel and output channels. Conv2d switches to this module when the shapes of batch size, input channel, and output channel does not meet the shape requirements of Tensor Core.

3, Dense_tensorcore: A module that optimizing dense operation on Tensor Core.

4, TensorCore_common: Tensor Core instructions for conv2d and dense, including loading, and storing data between shared memory and register. Supporting wmma (Tensor Core instructions) for three input shapes, 8x16x32, 16x16x16, and 32x16x8.

We acknowledge Siyuan Feng @Hzfengsy for his kindness and advices on the optimizations.

Tricks of optimization on Tensor Core

Computations of Conv2d featured by memory bandwidth bound on Tensor Core. Vectorized data loading and storing were utilized to accelerate data migration between global and shared memory.

NHWC layout was chosen to facilitate coalesced memory access on GPU, which is key for CUDA performance.

In order to support fused operations in Relay, a temporal buffer in shared memory is added to store results from register to global memory.

Offsets when reading and storing data into shared memory were auto tuned by autoTVM to avoid bank conflicts of shared memory.

Performance

The benchmarks below were running on V100 GPU (32 GB, 350W) and T4 GPU (16GB, 70W). Latency is reported with unit of ms.

batch size FP16(NCHW) Tensor Core(NHWC) SpeedUp
32 0.37 0.12 3.08
64 0.65 0.17 3.82
256 2.36 0.43 5.49
512 4.23 0.8 5.29

Table 1. 1x1 convolution with shape of 1024x256 on V100. Shape of input feature maps is 14x14x1024.

batch size FP16(NCHW) Tensor Core(NHWC) SpeedUp
32 1.11 0.29 3.83
64 1.95 0.49 3.98
256 7.03 1.64 4.29
512 13.34 3.46 3.86

Table 2. 1x1 convolution with shape of 1024x256 on T4. Shape of input feature maps is 14x14x1024.

batch size FP16(NCHW) Tensor Core(NHWC) SpeedUp
32 0.77 0.19 4.05
64 1.34 0.27 4.96
256 4.45 0.89 5.00
512 9.08 2.11 4.30

Table 3. 3x3 convolution with shape of 3x3x256x256 on V100. Shape of input feature maps is 14x14x256.

batch size FP16(NCHW) Tensor Core(NHWC) SpeedUp
32 1.9 0.56 3.39
64 3.24 0.7 4.63
256 13.61 2.96 4.60
512 28.24 6.03 4.68

Table 4. 3x3 convolution with shape of 3x3x256x256 on T4. Shape of input feature maps is 14x14x256.

batch size FP16 Tensor Core SpeedUp
32 0.1 0.03 3.33
256 0.31 0.06 5.17

Table 5. Dense on V100. Shape is 1024x2048 (OC, IC).

batch size FP16 Tensor Core SpeedUp
32 0.21 0.06 3.50
256 0.65 0.1 6.50

Table 6. Dense on T4. Shape is 1024x2048 (OC, IC).

batch size FP16_TVM FP16_TensorFlow FP16_XLA_TensorFlow* FP16_TensorCore_TVM
32 21.19 16.40 10.45 8.17
256 148.03 93.75 51.65 43.13

Table 7. Resnet50 on V100. *FP16_XLA_TensorFlow: Tensorflow benchmark with xla and fp16 on Tensor Core.

batch size FP16_TVM FP16_TensorFlow FP16_XLA_TensorFlow FP16_TensorCore_TVM
32 65.04 42.67 25.29 22.84
256 508.87 325.10 185.65 148.85

Table 8. Resnet50 on T4.

As we can see from tables above. The performance improvements for both unit tests and the Resnet50 benchmark are quite good. For Resnet50, the speedup of Tensor Core optimizations on TVM is up to 3.43, and 1.28 compared to TVM FP16 and TensorFlow with XLA, respectively.

Limitations

1, Only NHWC is supported for Tensor Core.

2, The loops of convolutions and dense were split into blocks that were consumed by Tensor Core. The dimensions of the blocks that fed into Tensor Core were relating to batch size, input channel, and output channel. Input shapes of Tensor Core must be (8, 16, 32), (16, 16, 16) or (32, 16, 8) for fp16, hereafter marked as (t1, t2, t3). Current implementation requires batch size, input channel, and output channel must be divided by t1, t2, and t3, respectively.

Open questions:

1, Feature maps and weighs should be converted to fp16 when Tensor Core and fp32 input is involved. The conversion between fp16 and fp32 degrades performance.

In order to fix this issue, we prefer to add a pass in Relay that automatically detect convert between fp16 and fp32, and eliminate those coupled conversions, such as converting from fp16 to fp32 for the previous operations firstly, and then converting back from fp32 to fp16 for the next.

2, Currently, Tensor Core only support computing with fp16, int8, int4, int2 and int1, that requires feature maps and weighs must be quantized before computing. Should we place weights quantization, such as fp32 to fp16, int8 etc., into quantization module?

Future Plans:

1, All shape ranges of batch size, input channel, and output channel will be supported.

2, Winograd on TensorCore with FP16 and Int8.

3, conv2d on Tensor Core with Int8.

4, dense on Tensor Core with Int8.

13 Likes

Thank you @Shawn_Inspur. It is very impressive and welcomed work!

Together with @Shawn_Inspur and his teammates, we have brought the Tensor Core support into TOPI and Relay. The current performance is much better than current workloads in TVM (without Tensor Cores) and TensorFlow (with Tensor Cores) but still slower than TensorRT. Optimization is still in progress. The code will be ready in 1-2 weeks and Shawn will publish the PR then.

It would be great to see the discussion about open questions. Also, any comments are welcomed!

cc @Laurawly @vinx13 @masahi

Does xla use cuDNN? Surprised to see such high performance.

Is TensorCore supported when “cuBLAS” is enabled as an external library? Also, is there a tutorial to show how to make sure these optimizations are enabled?

Have you compared with tensorRT int8 tensorcore performance?

Hi @Hzfengsy, we are currently working on very similar stuff on topi and relay side. But we are mostly targeting on int4 data type CNN tensorcore optimizations. Is it possible to share the code with us before sending the PR so that we are not doing duplicated commits?

It would be interesting to see the performance difference if the layout is NHWC.

After the PR #4353, TVM has supported TensorCore through cublas and cudnn. In this PR, we introduce an approach that allows users to use TVM schedule and AutoTVM to run TensorCore in conv2d and dense ops.

People can run the module through Relay without any modification. The Tensor Cores will be enabled as long as the shape satisfies the constraint and the layout is NHWC.

As I know, XLA invokes cuDNN for basic operations like matmuls and convolutions. XLA also generates optimized PTX codes for some fused ops.

Currently, only fp16 Tensor Core is supported. We donot compated it with tensorRT int8 performance.

We are pleased to share the codes. Currently we are making internal review of the codes. I will send the codes to you once internal review finished.

1 Like

cudnn uses a special layout (NC/32HW32) for TensorCore(int8) operations. Do you think the TensorCore int8 operations on NHWC layout can achieve the comparable performance with cudnn?

If designing a special layout for optimizing TensorCore int8 operations is needed, does this mean a GraphTuner is also needed to determine each layer’s tensor format?

Thanks, pls keep me posted on the code.

Here is the link: https://github.com/Shawn-Inspur/incubator-tvm/tree/tensorcore. For any questions, please feel free to let me know.

1 Like

Currently we are working on Tensor Core int8 with NHWC. The optimizations that were used in fp16 will be porting in int8, and the results will be published in the near future.

Yes, I think so. A graph tuner is helpful to explore performance improvements regarding layout. However, it is possible to apply a layout for the whole DNNs if you have implemented all the ops on that layout.

Hi:

Could you share your codes for Conv2D without Tensor Core? I would like to learn the schedule in Conv2D without Tensor Core.

Thank you very much.

We are pleased to share the codes. Please check the link below,

conv2d_nhwc.py

This is the code that has the same layout as conv2d of Tensor Core.

For any questions, please feel free to let me know.

Hi Shawn,

I’m curious about why need to set AS_align = chunk * wmma_k + offset but WS_align = warp_col_tiles * block_col_warps * wmma_k + offset. (AS_align do not include warp_row_tiles and block_row_warps)

Thank you