Dip in GFLOPS after stopping and continuing experiment

This plot shows GFLOPS for two tasks during iterations of RPC tuning in Ansor. The red seams mark points at which the experiment was killed and then later continued by supplying an non-null argument to load_log_file that points to the logged records of the previously killed tuning experiment. Here is the relevant code snippet:

if load_dir is None:
    load_records_str = None
else:
    load_dir = Path(load_dir)
    load_records_str = str((load_dir / 'records.json').resolve())
    import shutil
    shutil.copyfile(load_records_str, save_records_str)

if save_records_str:
    measure_callbacks=[auto_scheduler.RecordToFile(save_records_str)]
else:
    measure_callbacks=[]

print("Begin tuning...")
tuner = auto_scheduler.TaskScheduler(tasks, task_weights,
                                     load_log_file=load_records_str,
                                     strategy=scheduling_strategy)

The results sometimes makes it look like Ansor is starting tuning from scratch when I continue from an existing log file. Any possible reasons for this? Is this indicative of an unquiesced system and what is the variability that is correlated with the boundaries of where an experiment is stopped and continued?

I found the problem. While the docstring of TaskScheduler.init(…) and make_search_policies(…) says that cost models will be restored according to the load_log_file, this behaviour only occurs if the search_policy passed to TaskScheduler.tune (…) is a string. If it is a list of search policies (List[SearchPolicy]), then the cost model must be restored from the log file / model file before being passed to each SketchPolicy instance.

Adding a call to cost_model.load(…) or cost_model.update_from_file(…) should fix this.

2 Likes