AutoTVM vs AutoScheduler tuning metrics

calebpsanders · June 22, 2021, 11:24pm

Hello, I’m new to TVM and am exploring the tvmc application.

I’ve tuned a couple models using TVMC and have the following questions about the tuning metrics:

When tuning using AutoTVM (tvmc tune with --enable-autoscheduler disabled), one of the metrics printed in the console is Current/Best GFLOPS per task. I do not understand how this metric is being measured or calculated here? In the context of tuning a model, what is this metric describing?
When using AutoTVM, console data consists of Task/GFLOPS/Progress/Walltime. When using Ansor, data provided includes ID/Latency/Speed/Trials, while also including additional data like GA iter, fail_ct, min/max score, etc. What are the differences and similarities between the data provided by these two services, or are these details covered in the documentation somewhere that I’m missing? Without this info, interpreting tuning runs can be pretty challenging, especially from an entry-level perspective.
Finally, this question might stem from my lack of understanding of GFLOPS in the context of tuning a model, but the GFLOPS data that results from using Ansor is significantly lower than that of AutoTVM (when tuning the same model with the same tuning parameters). Does a higher GFLOP value indicate a better or worse tuned schedule?

Thanks in advance!

jiangjiajun · June 23, 2021, 2:43am

I have same questions. Also I’m confused about the log format generated by AutoSchedule, it’s total different with AutoTVM

comaniac · June 23, 2021, 5:00pm

When tuning using AutoTVM (tvmc tune with --enable-autoscheduler disabled), one of the metrics printed in the console is Current/Best GFLOPS per task. I do not understand how this metric is being measured or calculated here? In the context of tuning a model, what is this metric describing?

The GFLOP/s is measured by actually running the compiled operator on device. With a schedule candidate, we compile the operator, run it on the device to get the latency, and calculate the throughput by FLOPS / latency. Note that its per task, and you may have several tasks in a model, so AutoTVM needs to tune every unique task sequentially to achieve good end-to-end performance.

When using AutoTVM, console data consists of Task/GFLOPS/Progress/Walltime. When using Ansor, data provided includes ID/Latency/Speed/Trials, while also including additional data like GA iter, fail_ct, min/max score, etc. What are the differences and similarities between the data provided by these two services, or are these details covered in the documentation somewhere that I’m missing? Without this info, interpreting tuning runs can be pretty challenging, especially from an entry-level perspective.

They are just using different approaches. Ansor uses random sampling with evolutionary search to find the best schedule, so “GA iter” is the process of running evolutionary search, “fail_ct” is a counter of invalid schedule being explored in this iteration. max/min scores are the max/min schedule quality estimated by the performance cost model.

Finally, this question might stem from my lack of understanding of GFLOPS in the context of tuning a model, but the GFLOPS data that results from using Ansor is significantly lower than that of AutoTVM (when tuning the same model with the same tuning parameters). Does a higher GFLOP value indicate a better or worse tuned schedule?

Higher GFLOP/s does indicate better performance. Did you compare the end-to-end model performance after the tuning and find that the model tuned by Ansor is worse than AutoTVM? It could be reasons for you to see worse GFLOPS in Ansor compared to AutoTVM when looking at a single task. You need to provide more information to let people help dive into the root cause. For example, the tasks extracted by Ansor and AutoTVM are different, so it is incorrect if you simply compare GFLOP/s of the, for example, first task from both frameworks. Also, the tuning trial number may also affect the GFLOP/s per task, as Ansor uses task scheduler to prioritize important tasks.

calebpsanders · June 24, 2021, 3:29am

Thanks for this detailed reply, this was very helpful. Just to clarify, when you refer to a “schedule candidate” or the “best schedule”, you’re referring to the configuration of a specific operator (or task) on the hardware target, correct?

comaniac · June 24, 2021, 5:09pm

Roughly speaking yes. Each line in the tuning log represents a schedule configuration for an operator/task. AutoTVM/Ansor is able to decode and apply the configuration to a certain operator/task during the compilation.