Poor Performance with Tuning Large-Scale Tasks(nn.Linear) in Auto-Scheduler

hi @qiaoming , for nvidia devices with an architecture greater than sm_70, dense should levearge tensor core for better performance, auto scheduler can only tune for cuda cores. Considering the meta schedule with TensorIR to utilize tensor core for 4090.

from tvm import meta_schedule as ms

database = ms.tune_tir(
            mod=mod,
            target=target,
            max_trials_global=trails,
            num_trials_per_iter=16,
            work_dir=workdir,
            space=ms.space_generator.PostOrderApply(
                sch_rules="cuda-tensorcore",
                postprocs="cuda-tensorcore",
                mutator_probs="cuda-tensorcore"
            )
        )

Or you can also checkout the dlight or fastdlight(currently I temporarily make it into a new project Microsoft/BitBLAs) to get a high performance kernel quickly.