hi @qiaoming , for nvidia devices with an architecture greater than sm_70, dense should levearge tensor core for better performance, auto scheduler can only tune for cuda cores. Considering the meta schedule with TensorIR to utilize tensor core for 4090.
from tvm import meta_schedule as ms
database = ms.tune_tir(
mod=mod,
target=target,
max_trials_global=trails,
num_trials_per_iter=16,
work_dir=workdir,
space=ms.space_generator.PostOrderApply(
sch_rules="cuda-tensorcore",
postprocs="cuda-tensorcore",
mutator_probs="cuda-tensorcore"
)
)
Or you can also checkout the dlight or fastdlight(currently I temporarily make it into a new project Microsoft/BitBLAs) to get a high performance kernel quickly.