[auto-tvm]What does a cost model do in auto tuning

Hi all, I wonder what is the functionality of the output of the xgb cost model.

After we generate a set of configuration, we send them to the cost model, and we get the predicted flops of each config. After that, what we do? Do we still need to benchmark these configurations on the real machine?

So I wonder what does the cost model do? And how does it give advice about the next batch of sampling?

2 Likes

Iā€™m not an expert in auto tvm, but I have used the meta scheduler quite a bit and I think the ideas are the same, so let me attempt an answer here. As far as I understand the trained cost model is a fast way to check the predicted flops for each schedule config, while running the task on real machine is the slower way.

So the cost model is used to check the predicted flops for lots of schedule configurations, and the schedules that give the top k predicted flops are run on actual machine.

Running each schedule config on the machine when the search space of schedule configs for a full model could be in the hundreds of millions to billions is impractical. So the cost model is a way to reduce the tuning time and help guide the search in the right direction in the schedule space.

For details specific to autotvm cost model, I guess you could check out the Section 3 in https://arxiv.org/pdf/1805.08166.pdf

1 Like

You may imagine the cost model as a scoring function, which assigns a score to each schedule candidate in the space of all possible schedulings of a TensorIR PrimFunc, to assist the search algorithm to explore the space more efficiently and effectively.

First, letā€™s consider a case where the scheduling space (or sometimes we say design space of schedules interchangably) is extremely small. In this case, the search algorithm could simply exhaustively enumerate every single point (every schedule) in the space, and there is no need to have a scoring function :slight_smile:

Things get interesting then when the space gets larger, say for conv variants, the space could be even larger than 10^18, where exhaustive search becomes immediately infeasible. In this case, the search algorithm needs to be ā€œsmart enoughā€ to find out the optimal point in the space.

Next, how will we make the search algorithm smart though? Imagine we hypothetically have an oracle that gives the accurate running time of each candidate in the design space, then by parallelizing the search algorithm, it becomes possible that it explores a significant portion of the space to find out the approximately optimal candidate.

Realistically, however, an oracle doesnā€™t exist for sure, so here comes our learned cost model. It gives a relative running time (or normalized throughput, to be more accurate) based on existing data obtained throughout the search.

You may also refer to the paper of our latest generation tuning system, MetaSchedule, for details about what the search space realistically looks like: [2205.13603] Tensor Program Optimization with Probabilistic Programs. The evolutionary search as an algorithm is also described in the paper, but I suppose itā€™s not too different compared with ordinary evolutionary search.

3 Likes

Thank you! Iā€™m reading papers and have learned that the cost model is something like branch pruning in the searching process. It doesnā€™t need to predict the real speed of the lowered function, but only gives a relative order is ok.

Thank you! The explanation is detailed and easy to understand.

Now Iā€™m clear with the cost model, it just helps to find some top k candidates to do benchmark on a real machine.

I think the point that confuses me is that: how to find some candidates to do prediction on the cost model?

I read the tvm paper and find the following statements:

The explorer starts with random configurations, and, at each step, 
randomly walks to a nearby configuration. This transition is successful 
if cost decreases as predicted by the cost model. It is likely to fail 
(reject) if the target configuration has a higher cost.

My understanding is that: randomly find next batch and send them to the cost model, and then, pick all the ones that get higher score(which means lower time cost), and pick some of the ones that get lower score with a certain probability. And as the search goes on, reduce the permitting probability.

Is that right? hoping for your comment!

This is a classic heuristic called epsilon-greedy exploration. Say epsilon is a small fraction like 20%, then 80% of the next batch are chosen according to the cost model, i.e. the cost model believes that they are going to run fast, while the rest are chosen completely randomly.

The reason of having a non-zero epsilon is because the cost model is not an oracle - itā€™s prediction can be very wrong at times and usually quite biased. Therefore, if we have a zero epsilon, the search will only prefer the candidates that the cost model initially likes, and doesnā€™t explore the rest of the space where good candidates are mistakenly categorized by the cost model as bad ones

Thank you, I think I understand the importance of setting an improving probability of the cost model. I still wonder that, according to 80% of the next batch are chosen according to the cost model, is this done by selecting configurations nearby the current step and send them to the cost model to decide whether we adopt it or not ?

In MetaSchedule there is a parameter eps_greedy for you to tweak

1 Like

:rose: :rose: :rose: :rose: many thanks! Iā€™ll learn metaschedule later.

How can I tune my kernels by modifying the cost function so as to get the expected kernel performance

Please.Have you found a solution yet? Thanksļ¼