Difference between a auto tuning template and TOPI op

Hi, I try to improve the performance of the batch_matmul operator in the TOPI. I try to write a new schedule for the op and tune it and it yields about 232.59 GFLOPS. However, when I change the schedule in the TOPI batch_matmul file under the x86 folder, and extract tasks from a neural network and tune it, it performs pretty bad (93.31 GFLOPS) (I also check the tuner’s parameters). The input tensor shape is the same. What could be the problem?

Another weird problem is that, in the tuning stage the log shows that my new schedule performs twice better than the old one, but the evaulation of the inference time is about 5% slower.