How dose TVM elimitate calls of conv weights layout_transform?

Hi everyone!
I modified this sample(https://tvm.apache.org/docs/tutorials/frontend/from_pytorch.html) to add desired_layout NHWC to the network saved from pytorch(which uses NCHW):

 desired_layouts = {'qnn.conv2d': ['NHWC', 'HWIO'],
                     'nn.conv2d': ['NHWC', 'HWIO']
                     }
 # RemoveUnunsedFunctions is used to clean up the graph.
 seq = tvm.transform.Sequential([relay.transform.RemoveUnusedFunctions(),
                                 relay.transform.ConvertLayout(desired_layouts)]
                                 )
 with tvm.transform.PassContext(opt_level=3):
     mod = seq(mod)
 print(mod)

The dump of mod is expected: both input/ouput and each layer’s weight comes with a layout_tranform, for example:

  %0 = layout_transform(%input0, src_layout="NCHW", dst_layout="NHWC") /* ty=Tensor[(1, 224, 224, 3), float32] */;
  %1 = layout_transform(%conv1.weight, src_layout="OIHW", dst_layout="HWIO") /* ty=Tensor[(7, 7, 3, 64), float32] */;
  %2 = nn.conv2d(%0, %1, strides=[2, 2], padding=[3, 3, 3, 3], channels=64, kernel_size=[7, 7], data_layout="NHWC", kernel_layout="HWIO") /* ty=Tensor[(1, 112, 112, 64), float32] */;
  %3 = nn.batch_norm(%2, %bn1.weight, %bn1.bias, %bn1.running_mean, %bn1.running_var, axis=3) /* ty=(Tensor[(1, 112, 112, 64), float32], Tensor[(64), float32], Tensor[(64), float32]) */;
  %4 = %3.0;
  %5 = nn.relu(%4) /* ty=Tensor[(1, 112, 112, 64), float32] */;
  %6 = nn.max_pool2d(%5, pool_size=[3, 3], strides=[2, 2], padding=[1, 1, 1, 1], layout="NHWC") /* ty=Tensor[(1, 56, 56, 64), float32] */;
  %7 = layout_transform(%layer1.0.conv1.weight, src_layout="OIHW", dst_layout="HWIO") /* ty=Tensor[(3, 3, 64, 64), float32] */;
  %8 = nn.conv2d(%6, %7, padding=[1, 1, 1, 1], channels=64, kernel_size=[3, 3], data_layout="NHWC", kernel_layout="HWIO") /* ty=Tensor[(1, 56, 56, 64), float32] */;

However, I checked the gpu trace, there are only two layout transform

[CUDA memcpy HtoD]
**fused_layout_transform_11_kernel0** [513]
fused_nn_conv2d_add_nn_relu_7_kernel0 [517]
fused_nn_max_pool2d_kernel0 [521]
fused_nn_conv2d_add_nn_relu_6_kernel0 [525]
fused_nn_conv2d_add_add_nn_relu_3_kernel0 [529]
fused_nn_conv2d_add_nn_relu_6_kernel0 [532]
fused_nn_conv2d_add_add_nn_relu_3_kernel0 [535]
fused_nn_conv2d_add_nn_relu_5_kernel0 [539]
fused_nn_conv2d_add_kernel0 [543]
fused_nn_conv2d_add_add_nn_relu_2_kernel0 [547]
fused_nn_conv2d_add_nn_relu_4_kernel0 [551]
fused_nn_conv2d_add_add_nn_relu_2_kernel0 [554]
fused_nn_conv2d_add_nn_relu_3_kernel0 [558]
fused_nn_conv2d_add_1_kernel0 [562]
fused_nn_conv2d_add_add_nn_relu_1_kernel0 [566]
fused_nn_conv2d_add_nn_relu_2_kernel0 [570]
fused_nn_conv2d_add_add_nn_relu_1_kernel0 [573]
fused_nn_conv2d_add_nn_relu_1_kernel0 [577]
fused_nn_conv2d_add_2_kernel0 [581]
fused_nn_conv2d_add_add_nn_relu_kernel0 [585]
fused_nn_conv2d_add_nn_relu_kernel0 [589]
fused_nn_conv2d_add_add_nn_relu_kernel0 [592]
fused_nn_adaptive_avg_pool2d_kernel0 [596]
**fused_layout_transform_reshape_squeeze_kernel0** [600]
fused_nn_dense_add_kernel0 [604]
[CUDA memcpy DtoH]

I’m quite confused here, does this mean all of these kernels support NHWC as input while using OIHW filter parameters? Or does TVM transform these weight parameters in advance? Since there is no need to transform filters more than once.

PS: I’m working on loading a pytroch model(which in NCHW by default) into TVM and running it in NHWC format only(including input/output/each conv layer), so I expect there should be none layout_transform at all. Am I right?

After reading these two links:

https://discuss.tvm.apache.org/t/layout-conversion-pass/4009/15

https://tvm.apache.org/docs/dev/convert_layout.html

I’m still confused that for my case, running a NCHW pytorch model in TVM with NHWC input/output/conv2d, the final execution should not include any call of layout transform at all if my code is setup correctly? Is any changes needed to TVM itself, like adding someting to frontend/pytroch.py?

Thanks a lot!

If original model layout is NCHW and you convert to NHWC in TVM, at least two layout transformation are required: one at the beginning and one at the end.

Thanks for the reply Kevin! Those two layout trans make sense, but for filter parameters, they’re loaded from .pth with OIHW by default(relay/frontend/pytorch.py) and I set desired_layout for HWIO. Will these filter parameters be transformed in advanced or by a cuda kernel in each inference?

I guess they should be converted only once since these parameters are kind of constant data regarding the inference process. Could someone give me a hint which parts of code responsible for it? I observed the same number of layout_transform calls with conv calls in my model/running, therefore something is wrong with it. In comparison, the gpu trace of tvm resnet sample shows only two layout transform, which is expected.

I’m a very beginner to TVM code base and where should I start? Thanks a lot.

Yes weight layout transformation should be optimized by constant folding.