Currently, the GEMM schedules searched by TVM auto scheduler on NVIDIA GPUs have some big performance gaps compared with NVIDIA CUTLASS library (benchmark table shown below). For each new shape, TVM needs to tune for some time for the best schedule which is very insufficient for dynamic shape models. Another drawback of current solution in TVM auto scheduler is that it doesn’t have good support for NVIDIA Tensor Core instructions on different data types. To bridge the gaps between the GEMM performance of TVM and SOTA library cuBLAS, and Convolution performance of TVM and CUDNN, I propose to bring CUTLASS to TVM codegen and take the advantage of its ability to do operation fusion to potentially match/outperform the performance of models using cuBLAS. NVIDIA CUTLASS is an open source project and is a collection of CUDA C++ template abstractions for implementing high-performance matrix-multiplication (GEMM), and Convolution at all levels and scales within CUDA. It incorporates strategies for hierarchical decomposition and data movement similar to those used to implement cuBLAS. Based on NVIDIA’s official performance benchmark, CUTLASS can reach above 80% of CUBLAS performance on all workloads and can outperform cuBLAS on some workloads (figure from CUTLASS github shown below). By integrating CUTLASS into TVM, we get the following benefits:
For GEMM/Convolution kernels alone, we will speed up the current best TVM schedule tuned by auto-scheduler to above 80% of CUBLAS performance.
We have the potential to match TensorRT performance because we support op fusion by integrating CUTLASS in TVM while CUBLAS doesn’t.
We will support Tensor Core instructions for various data types.
Currently TVM needs a tophub database to store tuned schedules for all kinds of shapes. And when given a shape which hasn’t been searched before, it will take several hours to search or offloaded to the default schedule which is super slow. While using CUTLASS, we have a compiled 7G lib ready, which supports all kinds of shapes & data types with stable optimal performance. This will solve the dynamic shape issue of TVM.
GEMM TFLOPS on RTX3090
MNK Ansor-fp32 CUT-fp32-simt CUT-tf32 CUT-f16 CUT-int8 CUT-int4 512 7.047 6.902 16.193 25.678 40.721 51.501 1024 12.278 17.622 23.958 47.353 129.420 182.857 2048 11.936 20.037 32.975 71.100 209.294 388.731
This proposal is aim to solve static/dynamic shape schedule performance issues in TVM for dense and convolution kernels on NVIDIA GPUs.
We are currently on the initial stage of the integration and here’s an overview of the workflow:
- Use BYOC to support CUTLASS epilogue code generation for fusing purpose. Then we can leverage TVM to fuse the remaining element-wise & bcast op.
- Doing performance benchmark for popular dense workloads in Bert/Transformer and WDL models to compare with auto scheduler searched schedules.
- Update the RFC with the results.
- Rewrite element-wise fusing strategy to offload the codegen to CUTLASS for fusion optimization to support dynamic shape.
- Rewrite broadcast fusing strategy to offload the codegen to CUTLASS for fusion optimization to support dynamic shape.
- Send out PRs to TVM upstream.
Thanks, any thoughts or comments are appreciated.