Hi everyone!
I’ve been experimenting with the autotvm + TensorCore kernel generation, described in e.g. tutorial https://docs.tvm.ai/tutorials/optimize/opt_matmul_auto_tensorcore.html
I could reproduce TensorCore kernel generation for fp16 GEMMs for the input shapes specified in the tutorial, but when I try larger input shapes (e.g. N/M/K 8192/768/3072), the resulting kernel doesn’t seem to use TensorCores and is an order of magnitude slower than the cublas kernel. I’ve also experimented a bit with the schedule search space parameters but haven’t managed to get a faster kernel yet.
Is there something I’m missing for the larger inputs (e.g. schedule search space) or, is it because the existing TensorCore codegen’d implementation is better suited for smaller inputs? Would greatly appreciate suggestions here, especially if someone tried to codegen larger kernels.