Hello! I am currently testing the ResNet50 model on NVIDIA’s 40G A100 platform using https://github.com/masahi/tvm-cutlass-eval. After utilizing Cutlass, the compiled ResNet50 is 2.9 times faster compared to compiling it with CUDNN and CUBLAS. However, I find it strange that Cutlass can achieve such a significant speed-up. Is there anything I am overlooking? Below are the specifics of my testing process:
In order to run the resnet50/run.py using TVM’s main branch API, I made some modifications to the testing scripts:Update to run on main branch of tvm. · umiswing/tvm-cutlass-eval@3b4bd37 · GitHub
Link to reproduce the testing (python run.py):https://github.com/umiswing/tvm-cutlass-eval/blob/master/resnet50/run.py
Performance using cutlass:
Execution time summary:
mean (ms) median (ms) max (ms) min (ms) std (ms)
1.7029 1.5734 1.8596 1.5677 0.1418
Performance using cudnn+cublas:
Execution time summary:
mean (ms) median (ms) max (ms) min (ms) std (ms)
4.7330 4.6858 5.0248 4.6459 0.0859
cuda version and hardware
11.7
NVIDIA A100 40G
tvm version
commit 1d145f112115ca20a0cd2e37a726b1d1519cac4b
config.cmake
@@ -46,7 +46,7 @@
# - ON: enable CUDA with cmake's auto search
# - OFF: disable CUDA
# - /path/to/cuda: use specific path to cuda toolkit
-set(USE_CUDA OFF)
+set(USE_CUDA ON)
# Whether enable ROCM runtime
#
@@ -142,7 +142,7 @@ set(USE_MICRO_STANDALONE_RUNTIME OFF)
# - OFF: disable llvm, note this will disable CPU codegen
# which is needed for most cases
# - /path/to/llvm-config: enable specific LLVM when multiple llvm-dev is available.
-set(USE_LLVM OFF)
+set(USE_LLVM /usr/bin/llvm-config-11)
#---------------------------------------------
# Contrib libraries
@@ -217,10 +217,10 @@ set(USE_EDGETPU OFF)
# - ON: enable cuDNN with cmake's auto search in CUDA directory
# - OFF: disable cuDNN
# - /path/to/cudnn: use specific path to cuDNN path
-set(USE_CUDNN OFF)
+set(USE_CUDNN ON)
# Whether use cuBLAS
-set(USE_CUBLAS OFF)
+set(USE_CUBLAS ON)
# Whether use MIOpen
set(USE_MIOPEN OFF)
@@ -416,7 +416,7 @@ set(USE_GTEST AUTO)
# Enable using CUTLASS as a BYOC backend
# Need to have USE_CUDA=ON
-set(USE_CUTLASS OFF)
+set(USE_CUTLASS ON)