How to improve the autoTVM performance

Dear all, I am new to TVM. And I am comparing some performance between autoTVM and DNNL with X86 backend. I find in some conv2d operation such as the input NCHW[128,64,56,56] kernel OIHW[64,64,3,3]. DNNL has 30% better performance on my platform.

I check the memory layout of DNNL and what autoTVM suggests

  • DNNL use nChw16c as input and OIhw16i16o as weights
  • autoTVM suggests nChw64c and OIhw64i64o as weights

I am glad if I can do something to improve the performance of autoTVM in this problem size. But may need your kindly helps:

  • How can I know the search space of autoTVM in this problem size? I mean if autoTVM search this nChw16c memory layout or not?
  • If autoTVM search this memory layout, why not autoTVM pick it out as the best result?

Looking forward to your reply.

1 Like

The selection of 16c came from graph tuner. Specifically, you’ve gone through the following process:

  1. Tune all tasks with AutoTVM. This step will try all NCHW[x]c in a single op case, so you should find all candidates with their single op latencies in the log file.

  2. Provide the log file to the graph tuner. The step will do the following steps:
    2.1. Pick the best config of NCHW[x]c for every x appears in the log file.
    2.2. Benchmark the latency of layout transforms between each x.
    2.3. Use dynamic programming algorithm (by default) to determine the best x for each conv2d layer.

From the flow, the graph tuner may select 64c instead of 16c due to the following reason. Assuming the layout of the previous layer is NCHW32c and the layout of the next layer is NCHW8c.

Latency(NCHW16c) + Latency(32c to 16c) + Latency(16c to 8c) <
Latency(NCHW64c) + Latency(32c to 64c) + Latency(64c to 8c).

As a result, every term in the above equation may be the root cause why the graph tuner made this choice.

In addition, you may read the graph tuner paper for mode technical details: https://www.usenix.org/system/files/atc19-liu-yizhi.pdf

Thanks for the kind reply @comaniac. I have checked the autoTVM op tuning log. And I have 2 questions still:

cost is the kernel running latency in seconds. all_cost is the time including other overheads. We usually focus on cost only.

Error code 6 indicates build timeout. See the following for detail error codes:

Thanks @comaniac It helps. Fix the error by increase the compile timeout.

I still have another question: I profile the TVM running by tvm.contrib.debugger.debug_runtime. And I find some op’s actual running time even 2x longer comparing with the autoTVM best result log which is generated by the graph tuner(tuner.write_opt_sch2record_file(opt_sch_file)).

Any thought in this kind of problem?

I suspect if there is any thread competition. I mean there are some ops could be calculated in the inter-op level. Will these ops use the same thread pool at the same time? BTW: I may build TVM with the cmake openMP option enabled.

You can try to turn off the OpenMP option to use the TVM thread pool to see if that solves your problem.