For Intel x86 target, firstly, we should read the doc : https://tvm.apache.org/docs/tutorials/optimize/opt_gemm.html, which covers important aspects of tvm schedule primitives and its effect. Secondly, recommend to reading https://tvm.apache.org/docs/tutorials/autotvm/tune_simple_template.html, which tells us how to combine auto tvm and schedule to improve performance. Thirdly, we could enable Intel VTune to analyze what is the bottleneck of our program (LOAD occupies too much time or something else). Fourthly, we could refer some good libraries to learn what they do to improve performance, for example Intel oneDNN. Then we could try to implement the same mechanism using tvm (even tensorize). These are my experiences and suggestions.
@FrozenGene@tqchen Thanks for your advices.
I have written my own schedule and autotvm template. I also tried Intel OneDNN according to the BYOC tutorial. Currently I have outperformed OneDNN by about 0.4 ms on single cpu core.
I tried to perf it with VTune and collected hotspots report.
CPI rate is a little high. One reason is maybe we generate too many redundancy instructions. So tensorize GEMM core part maybe is one solution. As you have performed better than oneDNN, you could compute the efficiency of CPU (like 60%, 70% or …), if you have reached like 98% efficiency, you maybe hardly to improve next.
I noticed that vtune reported above 93% cpu utilization. (I only use single thread single cpu core.)
Does it mean there is not much room for improvement?
Yes, if you really want to improve, you need to analyze deeper. Like what kind of instruction effects lower performance then you should try to avoid it (Like using tensorize). I think your current performance is good enough now.