I mean the kernels after tuning are slower.
More precisely, measurements above are for an end-to-end execution of models with many kernels. Some kernels go faster, some slower - but overall the net effect is the models run slower after autoscheduling than the same model which has not been tuned.
I can look at profile data to find kernels and see which go slower. However, not quite sure on steps typically done after that.