FP16 GEMM CUDA kernel with TensorCore support for larger input shapes

iliacher · December 5, 2019, 10:12pm

Hi everyone!

I’ve been experimenting with the autotvm + TensorCore kernel generation, described in e.g. tutorial https://docs.tvm.ai/tutorials/optimize/opt_matmul_auto_tensorcore.html

I could reproduce TensorCore kernel generation for fp16 GEMMs for the input shapes specified in the tutorial, but when I try larger input shapes (e.g. N/M/K 8192/768/3072), the resulting kernel doesn’t seem to use TensorCores and is an order of magnitude slower than the cublas kernel. I’ve also experimented a bit with the schedule search space parameters but haven’t managed to get a faster kernel yet.

Is there something I’m missing for the larger inputs (e.g. schedule search space) or, is it because the existing TensorCore codegen’d implementation is better suited for smaller inputs? Would greatly appreciate suggestions here, especially if someone tried to codegen larger kernels.

Hzfengsy · December 6, 2019, 12:20am

Unfortunately, the auto TensorCore codegen works not good enough on large size because of its fragment number limitation. We suggest you to try tensorization directly to use TensorCore for large size GEMM and Conv2d. Please see details at https://docs.tvm.ai/tutorials/optimize/opt_conv_tensorcore.html