How to further improve the performance of given schedule?

xutianming · August 25, 2020, 3:01am

Dear developers, I am tuning a Conv1D - CNN model on intel x86 platform. I wrote my own schedule and autotvm tuning template.

How could I further improve the performance of my schedule ?? I tried tvm debugger but it only displayed the performance of single operator.

I want to further perf the “fused_nn_conv1d_add_nn_relu”.

tqchen · August 28, 2020, 5:32pm

cc @FrozenGene who might have some past experience in perf investigation

FrozenGene · August 31, 2020, 9:02am

For Intel x86 target, firstly, we should read the doc : https://tvm.apache.org/docs/tutorials/optimize/opt_gemm.html, which covers important aspects of tvm schedule primitives and its effect. Secondly, recommend to reading https://tvm.apache.org/docs/tutorials/autotvm/tune_simple_template.html, which tells us how to combine auto tvm and schedule to improve performance. Thirdly, we could enable Intel VTune to analyze what is the bottleneck of our program (LOAD occupies too much time or something else). Fourthly, we could refer some good libraries to learn what they do to improve performance, for example Intel oneDNN. Then we could try to implement the same mechanism using tvm (even tensorize). These are my experiences and suggestions.

xutianming · August 31, 2020, 12:50pm

@FrozenGene @tqchen Thanks for your advices. I have written my own schedule and autotvm template. I also tried Intel OneDNN according to the BYOC tutorial. Currently I have outperformed OneDNN by about 0.4 ms on single cpu core.

I tried to perf it with VTune and collected hotspots report.

The schedule performed unrolling and tiling a lot. It seemed that there were no obvious bottleneck.

Do you have any further suggestions about VTune ?
__dlopen appears on top. Why ?

xutianming · August 31, 2020, 1:54pm

FrozenGene · September 1, 2020, 2:35am

CPI rate is a little high. One reason is maybe we generate too many redundancy instructions. So tensorize GEMM core part maybe is one solution. As you have performed better than oneDNN, you could compute the efficiency of CPU (like 60%, 70% or …), if you have reached like 98% efficiency, you maybe hardly to improve next.

xutianming · September 1, 2020, 4:52am

@FrozenGene As for the cpu efficiency,

I noticed that vtune reported above 93% cpu utilization. (I only use single thread single cpu core.) Does it mean there is not much room for improvement?

FrozenGene · September 1, 2020, 5:24am

Yes, if you really want to improve, you need to analyze deeper. Like what kind of instruction effects lower performance then you should try to avoid it (Like using tensorize). I think your current performance is good enough now.