Ansor - number of tuning trials

I am tuning and compiling a model with a number of different targets (llvm, metal, opencl) and just came across this useful info about the number of tuning trials in this article:

You can set it to a small number (e.g., 200) for a fast demonstrative run. In practice, we recommend setting it around 800 * len(tasks), which is typically enough for the search to converge.

My model is quite large - it has ~72 tasks, which means the suggested number of trials would be 57600.

A tuning run with 100 trials took 2.8 hours to complete, so that makes it seem like a tuning run with 57600 trials would take 67 days!

I know I can parallelize this process but I’m wondering if the suggested number of trials is really correct? If so, is the run-time speed improvement gained by tuning predictable? Can I expect a certain performance gain from any given number of tuning trials?

I’m trying to pick a number that gets me the biggest bang for my buck. Thank you! :slight_smile:

The suggest number makes sense to me. It might even have to be much more for GPUs. What makes me surprise is you need around 3 hours for just 100 trials!? Usually 100 trials takes just a few minutes even for an edge device…

I’m not sure what information you need to help me, but yes, it takes a long time to tune! I’m on an M1 Macbook Pro and the tuning session in question was for the metal target using Ansor. Maybe it’s the input to the model? It has a very large (to me) input of around 1248x832.

What other factors affect the tuning time that I might play around with?

The solutions I could propose;

  1. Use task schedule to tune a whole 72 tasks for 20k trials. Although it means ~277 trials per task, the task scheduler will prioritize important tasks and allocate more trials on them. After 20k trials, you could evaluate the end-to-end latency. If you aren’t happy with it, you could continue tuning for more trials as you want (see Other Tips 3).

  2. If the time for evaluating 64 trials is really the bottleneck, you could first verify whether it is caused by long compilation time on Mac CPUs. If so, you could consider using other powerful CPUs for compilation, and put Mac as a remote device only for measurement. Just like the ARM CPU in this tutorial: Auto-scheduling a Neural Network for ARM CPU — tvm 0.9.dev0 documentation

100 trials / 2.8 hours is not expected. In addition to Cody’s suggestions, you can also tweak the argument of RPCRunner/LocalRunner to accelerate the measurement.