Cuda kernel performance degration with newly released nvcc compiler

LeiWang1999 · December 6, 2022, 12:17pm

Today I upgraded my cuda from cuda 11.1 to cuda 11.4, and I found some of the tvm kernels I wrote about earlier had a significant performance degradation, like different persicion gemm with tensorcore/simt, for example:

Before I upgraded cuda, I wrote a tvm code which can produce efficent dp4a gemm kernel (sota implementation, same performance with cutlass with permutation enabled by tensor ir, it takes about 140ms in the size of M=N=K=16384), and after upgrade, the performance of the kernel decreased to 800ms.

I even tested the newest cuda 11.8, but it have same behavior with cuda 11.4.

One interesting thing is that the tvm kernel is a re-implementation of my hands-on dp4a cuda kernel, so they have the same code structure, seems like only different in the computation of offset. but the performance of my hands-on kernel doesn’t make difference when cuda version changed, What a strange thing!

The performance log:

CUDA 11.1:

/usr/local/cuda-11.1/bin/nvcc -gencode arch=compute_86,code=sm_86 -O3 ./evaluate_dp4a_int8_int32_nn.cu -o evaluate_dp4a_int8_int32_nn ; ./evaluate_dp4a_int8_int32_nn

CUDA VERSION 11010
Problem Size : M 16384 N 16384 K 16384
hands-on cuda kernel time: 138.567 ms
tvm codegen cuda kernel time: 137.942 ms

CUDA 11.4:

/usr/local/cuda-11.4/bin/nvcc -gencode arch=compute_86,code=sm_86 -O3 ./evaluate_dp4a_int8_int32_nn.cu -o evaluate_dp4a_int8_int32_nn ; ./evaluate_dp4a_int8_int32_nn

CUDA VERSION 11040
Problem Size : M 16384 N 16384 K 16384
hands-on cuda kernel time: 140.32 ms
tvm codegen cuda kernel time: 804.435 ms

CUDA 11.8:

CUDA VERSION 11080
Problem Size : M 16384 N 16384 K 16384
hands-on cuda kernel time: 140.77 ms
tvm codegen cuda kernel time: 804.903 ms

By the way I got my result under ubuntu 18.04 with four 24GB gtx 3090

The code to reproduce the performance log:

gist.github.com

https://gist.github.com/LeiWang1999/24e820117361040b0e63fc74ca99bc48

evaluate_dp4a_int8_int32_nn.cu

/*
Command to reproduce:

/usr/local/cuda-11.1/bin/nvcc -gencode arch=compute_86,code=sm_86 -O3 ./evaluate_dp4a_int8_int32_nn.cu -o evaluate_dp4a_int8_int32_nn ; ./evaluate_dp4a_int8_int32_nn

CUDA VERSION 11010
Problem Size : M 16384 N 16384 K 16384
hands-on cuda kernel time: 138.567 ms
tvm codegen cuda kernel time: 137.942 ms

This file has been truncated. show original

The code to reproduce the tvm gpu kernel:

github.com

LeiWang1999/tvm_gpu_gemm/blob/master/tensorirscript_dp4a/sota_nn.py

"""
Problem definition:
    1.We read a int8 matrix A of shape (M, K) from global memory to shared memory.
    2.We need to do permutation on A to make it suitable for dp4a conflict free access.
    3.So we first need to read A from global memory to local memory.
    4.Then we need to do permutation on A in local memory.
    5.Finally we need to read A from local memory to shared memory.
Solution:
    In this python code, we use tensorir transform layout to do permutation.
    Take a Gemm example, and the size of Gemm is a Wrap tile of nvidia cutlass, which is 128x128x16.
Result:
    average time cost of 1 runs = 136.129 ms, 64616.1 GFLOPS. Sota Implementation!
"""

import tvm
from tvm import te
import numpy as np
import tvm.testing
from tvm.script import tir as T
import os

This file has been truncated. show original

I also benchmark this code under a 80GB A100 with cuda 11.6 and I got same performance gap:

CUDA VERSION 11060
Problem Size : M 16384 N 16384 K 16384
hands-on cuda kernel time: 128.995 ms
tvm codegen cuda kernel time: 500.599 ms

LeiWang1999 · December 6, 2022, 12:17pm

I guess the newer version of nvcc has some bugs or negative optimizations and the kernel that tvm generated just hit the corner case…

LeiWang1999 · December 6, 2022, 12:46pm

Seems like this schedule in newer nvcc caused Register spilling

junrushao · December 7, 2022, 12:50am

This is definitely interesting observation, and thanks for reporting! Will it affect the performance after MetaSchedule auto tuning?

Also CC: @Hzfengsy