Today I upgraded my cuda from cuda 11.1 to cuda 11.4, and I found some of the tvm kernels I wrote about earlier had a significant performance degradation, like different persicion gemm with tensorcore/simt, for example:
Before I upgraded cuda, I wrote a tvm code which can produce efficent dp4a gemm kernel (sota implementation, same performance with cutlass with permutation enabled by tensor ir, it takes about 140ms in the size of M=N=K=16384), and after upgrade, the performance of the kernel decreased to 800ms.
I even tested the newest cuda 11.8, but it have same behavior with cuda 11.4.
One interesting thing is that the tvm kernel is a re-implementation of my hands-on dp4a cuda kernel, so they have the same code structure, seems like only different in the computation of offset. but the performance of my hands-on kernel doesn’t make difference when cuda version changed, What a strange thing!
The performance log:
CUDA 11.1:
/usr/local/cuda-11.1/bin/nvcc -gencode arch=compute_86,code=sm_86 -O3 ./evaluate_dp4a_int8_int32_nn.cu -o evaluate_dp4a_int8_int32_nn ; ./evaluate_dp4a_int8_int32_nn
CUDA VERSION 11010
Problem Size : M 16384 N 16384 K 16384
hands-on cuda kernel time: 138.567 ms
tvm codegen cuda kernel time: 137.942 ms
CUDA 11.4:
/usr/local/cuda-11.4/bin/nvcc -gencode arch=compute_86,code=sm_86 -O3 ./evaluate_dp4a_int8_int32_nn.cu -o evaluate_dp4a_int8_int32_nn ; ./evaluate_dp4a_int8_int32_nn
CUDA VERSION 11040
Problem Size : M 16384 N 16384 K 16384
hands-on cuda kernel time: 140.32 ms
tvm codegen cuda kernel time: 804.435 ms
CUDA 11.8:
CUDA VERSION 11080
Problem Size : M 16384 N 16384 K 16384
hands-on cuda kernel time: 140.77 ms
tvm codegen cuda kernel time: 804.903 ms
By the way I got my result under ubuntu 18.04 with four 24GB gtx 3090
The code to reproduce the performance log:
The code to reproduce the tvm gpu kernel:
I also benchmark this code under a 80GB A100 with cuda 11.6 and I got same performance gap:
CUDA VERSION 11060
Problem Size : M 16384 N 16384 K 16384
hands-on cuda kernel time: 128.995 ms
tvm codegen cuda kernel time: 500.599 ms