I tune it for 20 trials by default. Even so, the best runtime (seen from the log) among the 20 trials is much better than the final runtime evaluation. I’m pretty sure the best history config is not loaded for the evaluation. Log:
No: 20 GFLOPS: 129.36/129.36 result: MeasureResult(costs=(0.00178956675,), error_no=0, all_cost=0.9837315082550049, timestamp=1583771181.213391) [('tile_f', [-1, 2, 16, 4]), ('tile_y', [-1, 1, 7, 1]), ('tile_x', [-1, 1, 1, 7]), ('tile_rc', [-1, 64, 1]), ('tile_ry', [-1, 1, 1]), ('tile_rx', [-1, 1, 1]), ('auto_unroll_max_step', 1500), ('unroll_explicit', 0)],None,3509127
Finish loading 20 records
Best config:
[('tile_f', [-1, 2, 16, 4]), ('tile_y', [-1, 1, 7, 1]), ('tile_x', [-1, 1, 1, 7]), ('tile_rc', [-1, 64, 1]), ('tile_ry', [-1, 1, 1]), ('tile_rx', [-1, 1, 1]), ('auto_unroll_max_step', 1500), ('unroll_explicit', 0)],None,3509127
Finish loading 20 records
Cannot find config for target=cuda, workload=None. A fallback configuration is used, which may bring great performance regression.
Time cost of this operator: 0.048827