How does the integration of cutlass reduce tuning time for GEMM/Conv2D?

Hello! I am interested in TVM’s work on cutlass. According to [RFC][BYOC]NVIDIA CUTLASS Integration and Better Tensor Core Support in TVM with CUTLASS @ TVMCon 2021, the integration of cutlass is not only for improving performance, but also for reducing tuning time. I have read the code under tvm/python/tvm/contrib/cutlass, and it seems that TVM picks the best kernel in a simple way, which is to profile all candidate kernels and select the best one. There is no rule to filter out some of the kernels. I am confused about how such an exhaustive search can reduce tuning time. Is the reduction in tuning time due to the reduction of the search space since we only need to search for the best combination of template parameters when using cutlass?

By the way, according to [RFC][BYOC]NVIDIA CUTLASS Integration, TVM compiles all candidate kernels into a 7G lib ahead of time. Should it be better to compile JIT to reduce compile time and size?