[Ansor] Task scheduler spending too much trial on dead-end tasks

@merrymercy @comaniac

I’m still looking at auto scheduling mask rcnn workloads (now with nvptx targets, since I found that dynamic ops run much faster if compiled with nvptx). There are two tasks that are much more costly than others (task 34 and 35, see below).

The problem is the performance on these tasks don’t improve after certain trials (say 1000), but the task scheduler is still trying too many trials on them, even though there is no improvement happening for more than 1000 trials.

Is there a way to let the task scheduler “give up” on certain tasks after it sees no improvement after certain trials, and on focus more on other tasks?

|  ID  | Latency (ms) | Speed (GFLOPS) | Trials |                                                                                                                            
-------------------------------------------------                                                                                                                            
|    0 |        0.001 |           0.00 |     64 |                                                                                                                            
|    1 |        0.001 |           0.00 |     64 |                                                                                                                            
|    2 |        0.001 |           0.00 |     64 |                                                                                                                            
|    3 |        0.001 |           0.00 |     64 |                                                                                                                            
|    4 |        0.001 |           0.00 |     64 |                                                                                                                            
|    5 |        0.888 |        3413.29 |    384 |                                                                                                                            
|    6 |        0.246 |          93.53 |    128 |                                                                                                                            
|    7 |        0.142 |        2347.76 |    128 |                                                                                                                            
|    8 |        0.606 |        4875.03 |    576 |                                                                                                                            
|    9 |        0.686 |        1924.58 |    256 |                                                                                                                            
|   10 |        0.667 |        2010.75 |    640 |                                                                                                                            
|   11 |        0.490 |        2687.18 |    320 |                                                                                                                            
|   12 |        0.655 |        4017.95 |    320 |                                                                                                                            
|   13 |        0.760 |        3882.47 |    320 |                                                                                                                            
|   14 |        1.480 |        1774.72 |    512 |                                                                                                                            
|   15 |        0.438 |        3024.36 |    576 |                                                                                                                            
|   16 |        0.442 |        2968.30 |    448 |                                                                                                                            
|   17 |        0.501 |        5886.45 |    512 |                                                                                                                            
|   18 |        0.677 |        3881.25 |    256 |                                                                                                                            
|   19 |        0.784 |        3764.09 |    320 |                                                                                                                            
|   20 |        0.893 |        2939.17 |    320 |                                                                                                                            
|   21 |        0.404 |        3259.75 |    832 |                                                                                                                            
|   22 |        0.291 |        4506.00 |    576 |                                                                                                                            
|   23 |        0.563 |        5238.59 |   1088 |                                                                                                                            
|   24 |        0.628 |        4175.38 |    320 |                                                                                                                            
|   25 |        0.922 |        3198.74 |    320 |                                                                                                                            
|   26 |        0.989 |        2651.13 |    384 |                                                                                                                            
|   27 |        0.394 |        3334.79 |    384 |
|   28 |        0.448 |        2924.52 |    320 |
|   29 |        0.597 |        4936.91 |    448 |
|   30 |        0.271 |        2415.55 |    128 |
|   31 |        0.667 |        1966.57 |    256 |
|   32 |        0.980 |        2678.81 |    320 |
|   33 |        1.528 |        3443.95 |    576 |
|   34 |        8.396 |        5621.50 |   2624 |
|   35 |        7.041 |        6704.06 |   2240 |
|   36 |        0.211 |        1166.61 |    128 |
|   37 |        2.262 |        5216.62 |    768 |
|   38 |        2.292 |        5149.95 |    768 |
|   39 |        0.056 |        1108.73 |     64 |
|   40 |        0.646 |        4564.29 |    320 |
|   41 |        0.024 |         654.22 |     64 |
|   42 |        0.196 |        3763.42 |    128 |
|   43 |        0.243 |        3039.72 |    128 |
|   44 |        0.009 |         417.38 |     64 |
|   45 |        0.002 |          20.88 |     64 |
|   46 |        0.174 |        1145.47 |     64 |
|   47 |        0.005 |         200.02 |     64 |
|   48 |        0.199 |         309.35 |     64 |
|   49 |        0.059 |         261.60 |     64 |
|   50 |        0.019 |         205.76 |     64 |
|   51 |        0.005 |         197.48 |     64 |
|   52 |        0.005 |          48.46 |     64 |
|   53 |        0.001 |           0.00 |     64 |
-------------------------------------------------
Estimated total latency: 55.204 ms      Trials: 15553   Used time : 36529 s     Next ID: 34
2 Likes

I can understand why task scheduler spends time on task 34 and 35, because their latencies are far worse than others lol

As a result, its behavior makes sense to me: Task scheduler spends most of the time on the performance bottleneck tasks, but it will also spend some time on other tasks. Besides that, I’m wondering how many improvements you could get if task 34 and 35 are marked as dead. If the performance is dominated by these two tasks, tuning other tasks probably may not help the end-to-end performance.

In terms of the solution to your question, task scheduler actually has a mechanism to give up certain tasks, but it currently only completely get rid of the task that has no more space to be explored:

So we might be making the early stopping criteria work on the task-level, so that the early stopped tasks will be added to the dead_tasks and won’t be touched anymore. As for the quick workarund, you could hack the dead_tasks by adding task 34 and 35 and see if there’s any improvemnt.

1 Like

Yes, I don’t expect to get a big end to end performance improvement by giving up task 34 and 35. Right now I’m comparing auto scheduler performance on CUDA vs NVPTX target, and I want to squeeze every milliseconds I can.

Looking at the how task scheduler behaves, intuitively I think it is much more worthwhile to spend trials and other tasks than endlessly trying 34 and 35. So I think it would make sense to have a “give-up” parameter that allows me to ignore dead-end tasks (triggered after no improvement for certain trials, for example).

Accordingly, we could extend the early_stopping in TuningOptions to be:

[A1] Extend to early_stopping=(scope, trials), where the scope could be “global” (default) or “task”, and trials is an integer applied to the scope.

[A2] Have two configs: early_stopping and early_stopping_task. The latter one adds the early stopped tasks to dead_tasks.

[A3] Change early_stopping to a callback function (default lambda _: False):

for idx, task in enumerate(tasks):
    if idx not in self.dead_task and early_stopping(task_cost):
        self.dead_tasks.add(idx)

@merrymercy do you have any suggestion?

1 Like

I tried to remove the task 34 and 35, but I’ve hit the issue of non deterministic task extraction (this uses VM task extraction). The two tasks now have ids 29, 30.

@comaniac We should definitely make the task extraction deterministic.

|   29 |        8.309 |        5679.90 |   3008 |                                                                                                                            
|   30 |        7.041 |        6704.06 |   2560 |       

Hmm I didn’t notice this problem before. Does it because the VM executing order is non-deterministic? If so then we should not rely on the task index here. We actually use workload key to match records, and I think this is the only place we use task index, because workload key is too long to be shown on the console.

For your workaround, you can use:

if task.workload_key == XXX:
    self.dead_tasks.add(idx)

I don’t know if non-determinism in VM is real, but I will see where it is coming from. This is bad not only for the task extraction.

1 Like

I think any of A1, A2 and A3 can work well.

PR filed.

1 Like