What to do if the auto-scheduling result is not good enough?

Hi, we found tvm usually works well on some popular CNN models (i.e. resnet) using auto-scheduling. But it may also give us extremely bad results on other models (i.e. BERT/OCR/super-resolution) comparing with tensorrt.

In those cases, we would also try to increase n_trials and see if it gives a better result. But if not lucky, we can only give up and switch to tensorrt then.

Just want to know if there is anything else we can do about the auto-scheduling process. Like different searching policies / debug / profiling …

First, you can check the printed log during the turing process, it should print out the time cost and GFlops of each task. GFlops is an important metric to show if a kernel is fast enough, you can compare that with the peak performance of you device.

Then, you can try GraphDebugRuntime(I’m not sure if it’s still called like this, just the debug mode of GraphRuntime), that will tell you the time cost of each subgraph of the whole network. Maybe the end to end performance is affected by some subgraph.

Emm … There’s a known issue that Auto-Scheduler currently cannot support TensorCore, so compared with TRT, this will be a big weakness.

:cry: :cry:

Thanks for your advice ! But I still have some questions here.

First, I understand that we should check the time cost / GFlops of each task to see if it is a bottleneck. However, my real problem is what to do next ? I guess simply do another search with larger n_trials won’t help at all.

For instance, from the log below, we can see task 1 contributes the biggest latency and task 0,6 got poor GFlops. Then, what are the suggested steps if I want to optimize them?

|  ID  | Latency (ms) | Speed (GFLOPS) | Trials |
-------------------------------------------------
|    0 |        0.510 |        4337.21 |     64 |
|    1 |        4.869 |        9707.68 |    384 |
|    2 |        1.860 |        6354.48 |     64 |
|    3 |        0.449 |        6574.38 |     64 |
|    4 |        1.086 |        8162.22 |    896 |
|    5 |        0.832 |       10648.79 |   1344 |
|    6 |        0.032 |        4341.36 |     64 |
|    7 |        0.195 |        7579.80 |    448 |
|    8 |        0.246 |        9016.77 |    576 |
|    9 |        0.328 |        8993.97 |    832 |
|   10 |        0.469 |        7870.86 |   1344 |

Second, we found tvm works well on resnet fp16 version (usually gives comparable or even better results compared with TensorRT). I’m wondering the tensor-core issue of auto-scheduler that you mentioned affects all models or just some specific kind of model structures ?

BTW, I found some people adding extra relay transform code explicitly before tuning.

Is it a general method that we should try ?

TaskScheduler uses “gradient” strategy by default which will distribute more tuning trials on some key tasks(for example, 4, 5, 10 here). If you want to try more tuning trials on some specific tasks, you can modify the tasks & task_weights in:

tuner = auto_scheduler.TaskScheduler(tasks, task_weights

directly and use “round-robin” strategy in TaskScheduler.

There’re some TensorCore schedule in AutoTVM(the default topi schedule), but AutoScheduler cannot support TensorCore properly at this moment. So maybe just use AutoTVM can get better performance in this condition.