Cuda kernel performance degration with newly released nvcc compiler

This is definitely interesting observation, and thanks for reporting! Will it affect the performance after MetaSchedule auto tuning?

Also CC: @Hzfengsy