Cuda kernel performance degration with newly released nvcc compiler

Today I upgraded my cuda from cuda 11.1 to cuda 11.4, and I found some of the tvm kernels I wrote about earlier had a significant performance degradation, like different persicion gemm with tensorcore/simt, for example:

Before I upgraded cuda, I wrote a tvm code which can produce efficent dp4a gemm kernel (sota implementation, same performance with cutlass with permutation enabled by tensor ir, it takes about 140ms in the size of M=N=K=16384), and after upgrade, the performance of the kernel decreased to 800ms.

I even tested the newest cuda 11.8, but it have same behavior with cuda 11.4.

One interesting thing is that the tvm kernel is a re-implementation of my hands-on dp4a cuda kernel, so they have the same code structure, seems like only different in the computation of offset. but the performance of my hands-on kernel doesn’t make difference when cuda version changed, What a strange thing!

The performance log:

CUDA 11.1:

/usr/local/cuda-11.1/bin/nvcc -gencode arch=compute_86,code=sm_86 -O3 ./evaluate_dp4a_int8_int32_nn.cu -o evaluate_dp4a_int8_int32_nn ; ./evaluate_dp4a_int8_int32_nn

CUDA VERSION 11010
Problem Size : M 16384 N 16384 K 16384
hands-on cuda kernel time: 138.567 ms
tvm codegen cuda kernel time: 137.942 ms

CUDA 11.4:

/usr/local/cuda-11.4/bin/nvcc -gencode arch=compute_86,code=sm_86 -O3 ./evaluate_dp4a_int8_int32_nn.cu -o evaluate_dp4a_int8_int32_nn ; ./evaluate_dp4a_int8_int32_nn

CUDA VERSION 11040
Problem Size : M 16384 N 16384 K 16384
hands-on cuda kernel time: 140.32 ms
tvm codegen cuda kernel time: 804.435 ms

CUDA 11.8:

CUDA VERSION 11080
Problem Size : M 16384 N 16384 K 16384
hands-on cuda kernel time: 140.77 ms
tvm codegen cuda kernel time: 804.903 ms

By the way I got my result under ubuntu 18.04 with four 24GB gtx 3090

The code to reproduce the performance log:

The code to reproduce the tvm gpu kernel:

I also benchmark this code under a 80GB A100 with cuda 11.6 and I got same performance gap:

CUDA VERSION 11060
Problem Size : M 16384 N 16384 K 16384
hands-on cuda kernel time: 128.995 ms
tvm codegen cuda kernel time: 500.599 ms

I guess the newer version of nvcc has some bugs or negative optimizations and the kernel that tvm generated just hit the corner case…

Seems like this schedule in newer nvcc caused Register spilling

This is definitely interesting observation, and thanks for reporting! Will it affect the performance after MetaSchedule auto tuning?

Also CC: @Hzfengsy