Ansor extract different workloads from the same model between CPU and GPU target

I was trying to using Ansor to do the optimization for the Resnet model. However, I observed that for different targets (e.g., CPU and GPU), although the total number of the extracted subgraphs/workloads are the same, the specific subgraphs/workloads are different.

For instance: The workload for GPU:

========== Task 7  (workload key: ["6f0503383aee3dbb94006cc087e0349a"]) ==========
placeholder = PLACEHOLDER [1, 256, 14, 14]
pad_temp(i0, i1, i2, i3) = tir.if_then_else(((((i2 >= 1) && (i2 < 15)) && (i3 >= 1)) && (i3 < 15)), placeholder[i0, i1, (i2 - 1), (i3 - 1)], 0f)
placeholder = PLACEHOLDER [256, 256, 3, 3]
compute(nn, ff, yy, xx) += (pad_temp[nn, rc, (yy + ry), (xx + rx)]*placeholder[ff, rc, ry, rx])
placeholder = PLACEHOLDER [1, 256, 14, 14]
T_add(ax0, ax1, ax2, ax3) = (compute[ax0, ax1, ax2, ax3] + placeholder[ax0, ax1, ax2, ax3])
placeholder = PLACEHOLDER [1, 256, 1, 1]
T_add(ax0, ax1, ax2, ax3) = (T_add[ax0, ax1, ax2, ax3] + placeholder[ax0, ax1, 0, 0])
T_relu(ax0, ax1, ax2, ax3) = max(T_add[ax0, ax1, ax2, ax3], 0f)

The workload for CPU:

========== Task 7  (workload key: ["629dcd4733ec6363e001d9ddb446bb31"]) ==========
placeholder = PLACEHOLDER [1, 2, 14, 14, 128]
data_pad(i0, i1, i2, i3, i4) = tir.if_then_else(((((i2 >= 1) && (i2 < 15)) && (i3 >= 1)) && (i3 < 15)), placeholder[i0, i1, (i2 - 1), (i3 - 1), i4], 0f)
placeholder = PLACEHOLDER [16, 2, 3, 3, 128, 16]
conv2d_NCHWc(n, oc_chunk, oh, ow, oc_block) += (data_pad[n, floordiv(ic, 128), (oh + kh), (ow + kw), floormod(ic, 128)]*placeholder[oc_chunk, floordiv(ic, 128), kh, kw, floormod(ic, 128), oc_block])
placeholder = PLACEHOLDER [1, 16, 14, 14, 16]
T_add(ax0, ax1, ax2, ax3, ax4) = (conv2d_NCHWc[ax0, ax1, ax2, ax3, ax4] + placeholder[ax0, ax1, ax2, ax3, ax4])
placeholder = PLACEHOLDER [1, 16, 1, 1, 16]
T_add(ax0, ax1, ax2, ax3, ax4) = (T_add[ax0, ax1, ax2, ax3, ax4] + placeholder[ax0, ax1, 0, 0, ax4])
T_relu(ax0, ax1, ax2, ax3, ax4) = max(T_add[ax0, ax1, ax2, ax3, ax4], 0f)

Thus, I got a few questions about this?

  1. What are the reasons behind this phenomenon?
  2. Is there a way to extract the same subgraph/workload for the DNN model while different targets (e.g., CPU and GPU)?
  3. I am not very familiar with the representation of the workloads, so I am wondering are the two different workloads still represent the same thing (e.g., same sub-graph of a DNN)? or is there a way to measure the similarity/difference between the two workloads?

Any comments or insight?

Thanks!

This is an expected behavior because CPU and GPU may use the different compute when lowering from Relay. Specifically, the CPU one uses the NCHWc compute while the GPU one uses the NCHW.

@Lizhi-Liao

  1. Most computes should be the same. Some compute are different because we may slightly tweak them to make them most suitable for different backends. The dispatch strategy is defined in relay op strategy (tvm/cuda.py at main · apache/tvm · GitHub, tvm/x86.py at main · apache/tvm · GitHub), which will pick the corresponding TOPI definition.
  2. Modify the op strategy files above
  3. Yes, the compute should be the same. The input and output should be the same, but just with different intermediate layouts.

For your specific problem, I found you used the NCHW layout. This is the reason why you get different computes. Ansor favors the NHWC layout, so we only optimized this case. For NCHW, the dispatch mechanism in relay will extract wrong tasks from autotvm templates. If you convert your model to NHWC (TLCBench/utils.py at 57eef4850bca6f1d35d1f1fb2ec41caef660f4a2 · tlc-pack/TLCBench · GitHub), you should get the same tasks. If you can see the tutorials for CPU and GPU (Auto-scheduling a Neural Network for x86 CPU — tvm 0.8.dev0 documentation, Auto-scheduling a Neural Network for NVIDIA GPU — tvm 0.8.dev0 documentation), you can find their tasks are the same.

Many thanks to @merrymercy for the informative answer. It helps me a lot! Also, thank @comaniac for the help!