Strategy to choose optimal Layout on x86

Hello,

I am studying the data layout conversion inside TVM. I understand that for x86, NCHW layout is tiled on C dimension. I am interested in knowing the logic used to choose the tiling factor x for NCHW[x]c. Can anyone point me to the code where I can study this. I could find that conv2d_alter_op_layout uses “tile_ic” and “tile_oc” to decide the tiling factor. I have been tracing the places where these two factors are modified. In the file tvm/conv2d_avx_1x1.py at main · apache/tvm · GitHub, it seems that the tiling factor is decided on the basis of simd vector length, but that’s not really the case.

Also, I want to know if the time for layout conversion is included in final GFLOP calculation?

Thanks.

1 Like

There are 3 cases:

  1. You didn’t tune the model using AutoTVM. In this case, [c] is selected based on the default TOPI schedule. Since most default schedules use the same [c], there is almost no layout transform overhead, but the [c] might not be optimal, of course.
  2. You tuned the model using AutoTVM. In this case, [c] is selected based on the best schedule of each tuning task (e.g., conv2d) from the tuning log. As you can imagine, it’s possible that the first conv2d is NCHW8c and the second conv2d is NCHW4c. In this case, a layout transform is inserted when building the model. Since the process is (tuning conv2ds) -> (insert layout transform), the layout transform latency won’t be included to either of the conv2d latency during the tuning process.
  3. You tuned the model using AutoTVM followed by the graph tuner. The graph tuner is based on 1) at most 20 candidates from each conv2d’s tuning log (each candidate has different [x])., and 2) the benchmarked layout transform latency. For example, when determining the [x] of conv2, graph tuner leverages the following dynamic programming equation:
Latency = min(conv1(x=a) + transform(a, b) + conv2(x=b), conv1(x=a) + conv2(x=a))

When Latency(transform(a, b) + conv2(x=b)) is better than Latency(conv2(x=a)), a layout transform will be inserted.

For (3), you could refer to this paper for details: https://www.usenix.org/system/files/atc19-liu-yizhi.pdf

2 Likes

Thank you for the insight. This helps. I am not using AutoTVM, I am using AutoScheduler and I could see change in layout way before starting the tuning. I ran the script tune_network_x86.py inside tutorial section and in extract task output, I could see each task is using different layout i.e. each input is tiled with different tile size. Here is the sample output of what extract task prints when layout is specified as NCHW :

=========================================================================== placeholder = PLACEHOLDER [1, 1, 28, 28, 512] placeholder = PLACEHOLDER [16, 1, 1, 1, 512, 64] conv2d_NCHWc(n, oc_chunk, oh, ow, oc_block) += (placeholder[n, floordiv(ic, 512), ((oh2) + kh), ((ow2) + kw), floormod(ic, 512)]*placeholder[oc_chunk, floordiv(ic, 512), kh, kw, floormod(ic, 512), oc_block])

===========================================================================

As you can see it uses conv2d_NCHWc operator instead of conv2d. This makes sense because for x86 it is expected to use tiled data layout but I could not understand selection of tiling factor here also the tiling factor is different for each layer. It is happening at relay level, though complete layout transformation does not happen here but tiling factor is decided to change the input to use desired layout. Can you please provide some insight for this case?

Thank you.

This is another story. AutoScheduler doesn’t have graph tuner so it won’t consider layout transfrom overhead when tuning a conv, which is similar to (2). For now, we haven’t supported graph-level optimization in AutoScheduler, so if you’re using AutoScheduler, it is suggested to convert the layout of your model from NCHW to NHWC if possible to achieve the best performance.

1 Like

Yes, I understand it is suggested to use NHWC for performance but I am interested in knowing the idea used to choose tile size for NCHW as I could see data layout tiling here as well. The output of compute dag also shows this optimisation. I understand it is happening through AlterOpLayout pass, but I am unable to trace the logic to choose specific tile size. Can you please help with this?

When building the model with tuning logs, AutoScheduler performs layout rewriting to replace the NCHW compute with NCHW[x]c according to the the [x] with the best latency in the tuning log.

1 Like