[MetaSchedule] Tricks to Improve Codegen Performance for Conv2D Ops

Are there any tricks to improve the codegen performance of MS, such that it can outperform cutlass conv2d kernels given large shapes?

For example, are there any space generation rules, such as cuda-tensorcore, that we can apply during tuning?