The performance of ResNet50 using CUDNN and CUBLAS is significantly slower compared to Cutlass

Hello! I am currently testing the ResNet50 model on NVIDIA’s 40G A100 platform using After utilizing Cutlass, the compiled ResNet50 is 2.9 times faster compared to compiling it with CUDNN and CUBLAS. However, I find it strange that Cutlass can achieve such a significant speed-up. Is there anything I am overlooking? Below are the specifics of my testing process:

In order to run the resnet50/ using TVM’s main branch API, I made some modifications to the testing scripts:Update to run on main branch of tvm. · umiswing/tvm-cutlass-eval@3b4bd37 · GitHub

Link to reproduce the testing (python

Performance using cutlass:

Execution time summary:
  mean (ms)   median (ms)    max (ms)     min (ms)     std (ms)
    1.7029       1.5734       1.8596       1.5677       0.1418

Performance using cudnn+cublas:

Execution time summary:
  mean (ms)   median (ms)    max (ms)     min (ms)     std (ms)
    4.7330       4.6858       5.0248       4.6459       0.0859

cuda version and hardware



tvm version

commit 1d145f112115ca20a0cd2e37a726b1d1519cac4b


@@ -46,7 +46,7 @@
 # - ON: enable CUDA with cmake's auto search
 # - OFF: disable CUDA
 # - /path/to/cuda: use specific path to cuda toolkit

 # Whether enable ROCM runtime
@@ -142,7 +142,7 @@ set(USE_MICRO_STANDALONE_RUNTIME OFF)
 # - OFF: disable llvm, note this will disable CPU codegen
 #        which is needed for most cases
 # - /path/to/llvm-config: enable specific LLVM when multiple llvm-dev is available.
+set(USE_LLVM /usr/bin/llvm-config-11)

 # Contrib libraries
@@ -217,10 +217,10 @@ set(USE_EDGETPU OFF)
 # - ON: enable cuDNN with cmake's auto search in CUDA directory
 # - OFF: disable cuDNN
 # - /path/to/cudnn: use specific path to cuDNN path

 # Whether use cuBLAS

 # Whether use MIOpen
@@ -416,7 +416,7 @@ set(USE_GTEST AUTO)

 # Enable using CUTLASS as a BYOC backend
 # Need to have USE_CUDA=ON

Indeed something seems off. Even on RTX 3070 I had a better result according to Have you tried nvprof?

Hi @masahi . I just ran a profiling on an A100, and I noticed that the GPU kernel times were similar. However, when compiling the model with CuDNN and CuBLAS, I discovered that the compiled model is taking a lot of time on cudaFree() and cudaStreamCreateWithFlag(), which are synchronous operations. This might be the reason why ResNet50 compiled with CuDNN and CuBLAS is running slow.

Links to profiling results:

Hi @masahi. Is this a performance issue with TVM or an expected acceleration since cutlass provides more fusion patterns, thereby avoiding the allocation and freeing of intermediate memory?

Not sure, but in general cuDNN shouldn’t be allocating a lot of intermediate memory. CUTLASS doesn’t allocate any temp storage for resnet at all.

Hi @masahi. Some update for this issue:

According to the timeline, when TVM compiles ResNet50 with cuDNN, sum of kernel’s duration is similar with ResNet50 compiled with cutlass, but ResNet50 compiled with cuDNN seems spends a lot of time on waiting something when executing the kernel, while model ResNet50 compiled with cutlass does not. The waiting behavior is strange but it could be another reason why ResNet50 compiled with cuDNN is so slow.

timeline of ResNet50 compiled with cuDNN

timeline of ResNet50 compiled with cutlass

Does TVM plan to debug with this issue recently?

No, at least I don’t have a plan to work on cuDNN related issues. We are happy with cutlass.