Glad to see the RFC! TVM performance on large gemm has troubled me for a long time. Looking forward to further benchmark on cutlass+fusion against cublas+nofusion.
One potential issue: autotvm selects the best implement from autotuned-gemm and cublas-gemm based on performance, then do the fusion. If cutlass is integrated, we need to select sub-graph level autotuning and then select the best.