Introduction
The optimizations were built on existing work of @Hzfengsy, see #4052 for details. We added tensor core enabled conv2d, dense, and Tensor Core instructions in Topi, and modified codes in Relay to enable autoTVM on parameters regarding Tensor Core.
Following major functions were added:
1, Conv2d_nhwc_tensorcore: A module that optimized Conv2d on Tensor Core with NHWC layout. Both Fp16 and Fp32 inputs and outputs are supported.
2, Conv2d_nhwc_direct: A module that implemented NHWC layout for Conv2d. Currently, Conv2d on Tensor Core only supported specific shapes of batch size, input channel and output channels. Conv2d switches to this module when the shapes of batch size, input channel, and output channel does not meet the shape requirements of Tensor Core.
3, Dense_tensorcore: A module that optimizing dense operation on Tensor Core.
4, TensorCore_common: Tensor Core instructions for conv2d and dense, including loading, and storing data between shared memory and register. Supporting wmma (Tensor Core instructions) for three input shapes, 8x16x32, 16x16x16, and 32x16x8.
We acknowledge Siyuan Feng @Hzfengsy for his kindness and advices on the optimizations.
Tricks of optimization on Tensor Core
Computations of Conv2d featured by memory bandwidth bound on Tensor Core. Vectorized data loading and storing were utilized to accelerate data migration between global and shared memory.
NHWC layout was chosen to facilitate coalesced memory access on GPU, which is key for CUDA performance.
In order to support fused operations in Relay, a temporal buffer in shared memory is added to store results from register to global memory.
Offsets when reading and storing data into shared memory were auto tuned by autoTVM to avoid bank conflicts of shared memory.
Performance
The benchmarks below were running on V100 GPU (32 GB, 350W) and T4 GPU (16GB, 70W). Latency is reported with unit of ms.
batch size | FP16(NCHW) | Tensor Core(NHWC) | SpeedUp |
---|---|---|---|
32 | 0.37 | 0.12 | 3.08 |
64 | 0.65 | 0.17 | 3.82 |
256 | 2.36 | 0.43 | 5.49 |
512 | 4.23 | 0.8 | 5.29 |
Table 1. 1x1 convolution with shape of 1024x256 on V100. Shape of input feature maps is 14x14x1024.
batch size | FP16(NCHW) | Tensor Core(NHWC) | SpeedUp |
---|---|---|---|
32 | 1.11 | 0.29 | 3.83 |
64 | 1.95 | 0.49 | 3.98 |
256 | 7.03 | 1.64 | 4.29 |
512 | 13.34 | 3.46 | 3.86 |
Table 2. 1x1 convolution with shape of 1024x256 on T4. Shape of input feature maps is 14x14x1024.
batch size | FP16(NCHW) | Tensor Core(NHWC) | SpeedUp |
---|---|---|---|
32 | 0.77 | 0.19 | 4.05 |
64 | 1.34 | 0.27 | 4.96 |
256 | 4.45 | 0.89 | 5.00 |
512 | 9.08 | 2.11 | 4.30 |
Table 3. 3x3 convolution with shape of 3x3x256x256 on V100. Shape of input feature maps is 14x14x256.
batch size | FP16(NCHW) | Tensor Core(NHWC) | SpeedUp |
---|---|---|---|
32 | 1.9 | 0.56 | 3.39 |
64 | 3.24 | 0.7 | 4.63 |
256 | 13.61 | 2.96 | 4.60 |
512 | 28.24 | 6.03 | 4.68 |
Table 4. 3x3 convolution with shape of 3x3x256x256 on T4. Shape of input feature maps is 14x14x256.
batch size | FP16 | Tensor Core | SpeedUp |
---|---|---|---|
32 | 0.1 | 0.03 | 3.33 |
256 | 0.31 | 0.06 | 5.17 |
Table 5. Dense on V100. Shape is 1024x2048 (OC, IC).
batch size | FP16 | Tensor Core | SpeedUp |
---|---|---|---|
32 | 0.21 | 0.06 | 3.50 |
256 | 0.65 | 0.1 | 6.50 |
Table 6. Dense on T4. Shape is 1024x2048 (OC, IC).
batch size | FP16_TVM | FP16_TensorFlow | FP16_XLA_TensorFlow* | FP16_TensorCore_TVM |
---|---|---|---|---|
32 | 21.19 | 16.40 | 10.45 | 8.17 |
256 | 148.03 | 93.75 | 51.65 | 43.13 |
Table 7. Resnet50 on V100. *FP16_XLA_TensorFlow: Tensorflow benchmark with xla and fp16 on Tensor Core.
batch size | FP16_TVM | FP16_TensorFlow | FP16_XLA_TensorFlow | FP16_TensorCore_TVM |
---|---|---|---|---|
32 | 65.04 | 42.67 | 25.29 | 22.84 |
256 | 508.87 | 325.10 | 185.65 | 148.85 |
Table 8. Resnet50 on T4.
As we can see from tables above. The performance improvements for both unit tests and the Resnet50 benchmark are quite good. For Resnet50, the speedup of Tensor Core optimizations on TVM is up to 3.43, and 1.28 compared to TVM FP16 and TensorFlow with XLA, respectively.
Limitations
1, Only NHWC is supported for Tensor Core.
2, The loops of convolutions and dense were split into blocks that were consumed by Tensor Core. The dimensions of the blocks that fed into Tensor Core were relating to batch size, input channel, and output channel. Input shapes of Tensor Core must be (8, 16, 32), (16, 16, 16) or (32, 16, 8) for fp16, hereafter marked as (t1, t2, t3). Current implementation requires batch size, input channel, and output channel must be divided by t1, t2, and t3, respectively.
Open questions:
1, Feature maps and weighs should be converted to fp16 when Tensor Core and fp32 input is involved. The conversion between fp16 and fp32 degrades performance.
In order to fix this issue, we prefer to add a pass in Relay that automatically detect convert between fp16 and fp32, and eliminate those coupled conversions, such as converting from fp16 to fp32 for the previous operations firstly, and then converting back from fp32 to fp16 for the next.
2, Currently, Tensor Core only support computing with fp16, int8, int4, int2 and int1, that requires feature maps and weighs must be quantized before computing. Should we place weights quantization, such as fp32 to fp16, int8 etc., into quantization module?
Future Plans:
1, All shape ranges of batch size, input channel, and output channel will be supported.
2, Winograd on TensorCore with FP16 and Int8.
3, conv2d on Tensor Core with Int8.
4, dense on Tensor Core with Int8.