[RFC]FamilySeer: A new search method for Auto-scheduler

shanjunZ · January 8, 2022, 2:51am

Motivation:

Auto-scheduler (Ansor) uses code sketch and optimization rules to generate a large search space. The search space defined by Ansor has shown great opportunities and therefore the search quality and the search efficiency are determined by how we search the space.

Ansor utilizes improved cost model and task scheduler to help explore the search space. The cost model analyzes and finds high-performance code transformations in the search space and the task scheduler allocates the time budget to different computation graphs. However, we find serval drawbacks to this approach:

The accuracy of the cost model determines the search quality, but Ansor uses monolithic cost model to predict different computation graphs (subgraphs), resulting in an accuracy loss during tuning.
The task scheduler allocates most of the time budget to subgraphs with most improving potential (i.e., those with the highest latency). This approach works well at the beginning of the autotuning. However, as the potential subgraph gradually reaches its peak performance with adequate time budget, other subgraphs have little time budget to reach its peak performance.

The search process will at the end take a dozen of hours. This motivates us to find better way to explore the search space.

Solution:

We implement FamilySeer. A new search method to improve the search process of Ansor.

The overview of FamilySeer is shown above. FamilySeer has two key features:

FamilySeer analyzes the similarity between subgraphs. The similarity can be described as the accuracy among subgraphs. Then we group these subgraphs into a subgraph family. The subgraph inside the family shares the same cost model to have better accuracy.
When the cost model is updated during the search, we can foresee the subgraphs inside the same family using the highly accurate cost model. We use the cost model to find the exact high-performance code transformation instead of evaluate each code transformation, meaning no more costly tuning has to be done.

We evaluate our method and Ansor from both the search efficiency and the search quality.The search efficiency can be described as how much time has been used before reaching an optimal latency, and the search quality is directly related to the latency. We choose serval representative deep learning models like (Resnet, Mobilenet, Bert, Roberta, GPT2 and Vision Transformer) and evaluate them on Both CPU and GPU. We achieve an average speedup of 2.49x and 3.04x on search efficiency, respectively. We also achieve an improvement up to 1.14x on search quality.

More details and experiment results can be found in this paper.

How to use:

The code was based on TVM 0.8(Commit: 64𝑐1𝑏79) so we are trying to move the code to the master. We integrated our method inside Ansor so we can simply change the search_policy parameter to activate our approach.

tuner = auto_scheduler.TaskScheduler(
tasks, task_weights, load_log_file=log_file)
tune_option = auto_scheduler.TuningOptions(
num_measure_trials=900 \* (len(tasks) + 1), # change this to 20000 to achieve the best performance
runner=ctx_list[0].runner,
measure_callbacks=[auto_scheduler.RecordToFile(log_file)],

    )
    tuner.tune(tune_option,search_policy="sketch.xgb.family_op") # Changing the search_policy from “sketch.xgb” to “sketch.xgb.family_op”

Comments are appreciated.

junrushao · January 8, 2022, 4:07am

Thanks for the proposal! We are very interested in improving search algorithms and cost model. I was very excited to read about FamilySeer a week ago.

In terms of the subgraph similarity, AFAIK @comaniac and @zxybazh have been working independently on this topic to improve overall search time; On cost model, @merrymercy and I have been working on a project called TenSet to train a more effective cost model, where we do have task embeddings etc

Looking forward to close collaboration on your PRs

sunggg · January 18, 2022, 3:54pm

Hi, @shanjunZ . Thank you for the proposal! It is an absolutely important problem and I have a few questions while reading FamilySeer.

How many cost models does FamilySeer usually construct?
Ideally, we may want to transfer knowledge from the tuning experience across the tuning jobs like we do with TenSet or Lorien. However, since FamilySeer builds multiple cost models each time, it is not clear to me how we do this or how easy it would be.
I’m wondering if you tried a single cost model for your Forsee tuning and compared with your multi-cost model approach. If the problem of single cost model is with biased training data, what happens if we use more balanced data? I think we may be able to balance it by adjusting search like you did with Foresee tuning. Would multi-cost model approach still win in this case?
How much budget did you use for Section 6.3?

Thank you and I will look forward to learn more about your work

shanjunZ · February 5, 2022, 3:29pm

Thanks for your reply!

A: In our evaluated deep learning models, there are usually less than eight cost models constructed by FamilySeer. For example, FamilySeer constructs five cost models for ResNet50_v1.

A: To transfer the tuning knowledge in terms of pre-trained cost models, it is straightforward for FamilySeer to provide such cost models at the basis of subgraph family. When a tuning task arrives, it uses FamilySeer techniques to identify the family each subgraph belongs to and applies corresponding pre-trained cost models to improve both search quality and efficiency. Beyond that, instead of training the cost models from scratch, FamilySeer can work along with pre-trained cost models, and improve the accuracy of the cost models when adapting to a new model. Our experimental results (not yet updated in the paper) show that when using FamilySeer to adapt the TenSet pre-trained cost model to ViT-Huge model, FamilySeer can achieve better tuning accuracy than TenSet.

A: In Figure 1 of our paper, we conduct an experiment of the cost model accuracy on BERT-Large model. The BERT-Large can be partitioned into 11 subgraphs, and we select 256 samples (candidates) for each subgraph to build the single (monolithic) cost model (256 × 11 training samples in total). In such experiment, each subgraph has the same number of training samples, which means each subgraph receives balanced data. Even in such settings, the experimental results show that multiple cost models can still achieve better accuracy than a single cost model, which justifies our adoption of multiple cost models for different subgraph families.

A: In Section 6.3, we use the same time budget recommended by Ansor, which suggests 900 trials for each subgraph on GPU and 800 trials for each subgraph on CPU.

sergesg · July 27, 2022, 7:02am

Good job! I wonder if your work is upstream? Thanks