Auto-tuned model speed up issue

Edwardmark · November 22, 2019, 9:28am

You are right, but the time consuming log only contains the shape infomation, how can I compare their workload?

gasgallo · November 22, 2019, 9:33am

Look in your graph/model definition, there you can match the op in the log with the respective op in the original model. Then note the parameters of the identified op (stride, dilation, etc for convolution) and finally use those parameters and input shape to identify the right task.

Edwardmark · November 22, 2019, 10:01am

Thank you very much, but in the graph.json file, I can only find the name and its input id, how can I know its parameters of the identified op (stride, dilation, etc for convolution) ?

Now we can match the name from the graph.json file with the name in the log. And we can know the tasks parameters(stride, dilation, etc for convolution) and the shape of the task in the log and in the extracted task.
So we can match the shape in the log and in the tasks, but got multiple matched tasks with the same shape, how to utilize the graph.json to determin which task should I choose? Best, Edward.

gasgallo · November 22, 2019, 10:12am

Match using also stride, dilation, etc. those parameters are listed in the workload of the task as well.

Edwardmark · November 22, 2019, 11:11am

Yes, it is in the workload, but not in the log or the graph.json.

comaniac · November 22, 2019, 6:10pm

There are two approaches in general for optimizing a model.

First, like what you did, profiling the model using debugger runtime and identify the bottleneck tasks. As you have experienced, it is a bit vague about mapping the built layer back to tuning tasks. There’s no clear way, unfortunately, so you can only analyze your original model definition, as @gasgallo suggested.

Second, when you tune tasks, you should see the log like the following:

[1 / ?? Tasks] 110/5000 GFlop/s
...

It clearly shows the best throughput each task has achieved, and it’s easier for you to identify which task performs worse than others. If you didn’t keep such log but only the config JSON, you can first use autotvm.record.pick_best('all_config.json', 'best.json') to get the best config along with measurement results in best.json. Each line in best.json indicates the best config of one op. It looks like the following:

{"i": ["cuda -model=unknown", "topi_nn_conv2d", [["TENSOR", [1, 448, 10, 10], "float32"], ["TENSOR", [384, 448, 3, 3], "float32"], [1, 1], [0, 0], [1, 1], "NCHW", "float32"], {}, ["conv2d", [1, 448, 10, 10, "float32"], [384, 448, 3, 3, "float32"], [1, 1], [0, 0], [1, 1], "NCHW", "float32"], {"i": 611690, "t": "winograd", "c": null, "e": [["tile_b", "sp", [-1, 1, 1, 1]], ["tile_y", "sp", [-1, 1, 16, 2]], ["tile_x", "sp", [-1, 1, 16, 1]], ["tile_rc", "sp", [-1, 32]], ["auto_unroll_max_step", "ot", 1500], ["unroll_explicit", "ot", 0]]}], "r": [[2.9757927989520185e-05], 0, 6.701982259750366, 1572418412.9522758], "v": 0.1}

By parsing this line, you can get the measurement runtime by np.mean(line['r'][0]). Then you can infer which op consumes the longest time.

Edwardmark · November 25, 2019, 1:34am

Thank you very much, comaniac, I will give it a try.

cheneyfan · November 25, 2019, 2:38am

After auto-tune， there are still performance regression for some layer，so speed is still very slow， but we don’t know how to continus tune , is there any tutorial?

Edwardmark · November 25, 2019, 6:15am

I found a strage thing, the op time cost in the debegger log is not the same as best.json, why is that? I think the sum of time cost of the ops in debegger log should be the same as the sum of the time cost of the tasks in best json. And the matched tasks in the debegger log is not the same as in the best.json.

Edwardmark · November 25, 2019, 6:50am

What does “r” stand for? And why calculate the mean? Thanks for your patience and enthusiasm, comaniac.

comaniac · November 25, 2019, 7:18am

I think that’s because the debugger shows a fused layer and the tuning log only shows an op. However, I don’t think the op fusion will change the bottleneck op. It means the bottleneck op you found in best.json should also be the bottleneck op at debugger.
“r” stands for “result”. The reason we need to calculate the mean is that line['r'][0] is an array of multiple times of measurement results (if you set repeat > 1 in the tuning option).

Edwardmark · November 25, 2019, 7:40am

I am really confused, in my setting, the max time cost in best.json is 1.14ms, but in the debugger, the max time cost is more than 5ms, why is that? It is so strange.

comaniac · November 25, 2019, 7:58am

As I said, the one that debugger reported includes not only conv2d but also bias add, ReLU and other ops, so it definitely take longer than what you saw in best.json, which only measured conv2d.

Edwardmark · November 25, 2019, 9:24am

The log and the json file is in log vs best.json.
tvm.log is the debugger log, and task_extract.log is the tasks extracted from the tuning code.

Thanks, but what confused me is that, the most time-consuming task extracted in best.json is not that in debugger when I use the shape to match. For example, the first time-consuming op, matched with task.args, I can find that the idx of this task is 16 after extracting tasks in file tasks_extract.log, but I search [1, 190, 40, 40] in best.json and found that it is in the 4th line and 33 line. Do I miss something ? How can I determin the task_idx I should auto-tune?

So can you tell me the format of best.json file? What dose each element mean? And is the order in best.json same as the order when extracting tasks?

comaniac · November 25, 2019, 5:43pm

The line number in best.json doesn’t indicate its index. The order in that file is just a random hash order by Python. When you apply the history best, AutoTVM matches 1) shapes, 2) attributes such as strides and padding to apply the config.

Accordingly, the way you determine the task index is still matching their shapes and attributes.

Edwardmark · November 26, 2019, 5:42am

OK, so I should match the best.json with tasks extracted by tuning code by shapes and attributes, and selected the top time-consuming tasks for further tuning , is that right?

And I found that the shapes of the most time consuming ops in best.json is not the same with that in debugger.log, so how can I choose the real bottleneck is really difficult. And is it sure that further tuning the bottleneck ops can improve the inference speed?

Thank you very much.