Dear all, I am new to TVM. And I am comparing some performance between autoTVM and DNNL with X86 backend. I find in some conv2d operation such as the input NCHW[128,64,56,56] kernel OIHW[64,64,3,3]. DNNL has 30% better performance on my platform.
I check the memory layout of DNNL and what autoTVM suggests
DNNL use nChw16c as input and OIhw16i16o as weights
autoTVM suggests nChw64c and OIhw64i64o as weights
I am glad if I can do something to improve the performance of autoTVM in this problem size. But may need your kindly helps:
How can I know the search space of autoTVM in this problem size? I mean if autoTVM search this nChw16c memory layout or not?
If autoTVM search this memory layout, why not autoTVM pick it out as the best result?
The selection of 16c came from graph tuner. Specifically, you’ve gone through the following process:
Tune all tasks with AutoTVM. This step will try all NCHW[x]c in a single op case, so you should find all candidates with their single op latencies in the log file.
Provide the log file to the graph tuner. The step will do the following steps:
2.1. Pick the best config of NCHW[x]c for every x appears in the log file.
2.2. Benchmark the latency of layout transforms between each x.
2.3. Use dynamic programming algorithm (by default) to determine the best x for each conv2d layer.
From the flow, the graph tuner may select 64c instead of 16c due to the following reason. Assuming the layout of the previous layer is NCHW32c and the layout of the next layer is NCHW8c.
Latency(NCHW16c) + Latency(32c to 16c) + Latency(16c to 8c) <
Latency(NCHW64c) + Latency(32c to 64c) + Latency(64c to 8c).
As a result, every term in the above equation may be the root cause why the graph tuner made this choice.
I see result.error_no is 6 in the tuning log of nChw16c and OIhw16i16o. What’s the possible reason behind error? As for nChw64c do has the minimum result.cost among all the task space with no error.
Thanks @comaniac It helps. Fix the error by increase the compile timeout.
I still have another question: I profile the TVM running by tvm.contrib.debugger.debug_runtime. And I find some op’s actual running time even 2x longer comparing with the autoTVM best result log which is generated by the graph tuner(tuner.write_opt_sch2record_file(opt_sch_file)).
Any thought in this kind of problem?
I suspect if there is any thread competition. I mean there are some ops could be calculated in the inter-op level. Will these ops use the same thread pool at the same time? BTW: I may build TVM with the cmake openMP option enabled.