CovertLayout for nn.bias_add Operation

Hello, I want to import a model from PyTorch into TVM and port it to my architecture that only supports NHWC data layout. I’m using the ConvertLayout transform for this to do so for qnn.conv2d. But the frontend inserts another layout_transform right after conv2d to again change the layout for the bias_add operation. Is there a way to get rid of this layout_transform? I expect all my data to be in NHWC format as I want to merge bias_add and conv2d into one operation

The IR looks like this:

fn (%layer1_input: Tensor[(1, 16, 16, 16), float32] /* ty=Tensor[(1, 16, 16, 16), float32] */, %c1_weight: Tensor[(16, 16, 3, 3), float32] /* ty=Tensor[(16, 16, 3, 3), float32] */, %c1_bias: Tensor[(16), float32] /* ty=Tensor[(16), float32] */) -> Tensor[(1, 16, 14, 14), uint8] {
  %0 = qnn.quantize(%layer1_input, 0.00786161f /* ty=float32 */, 0 /* ty=int32 */, out_dtype="uint8", axis=1) /* ty=Tensor[(1, 16, 16, 16), uint8] */;
  %1 = qnn.quantize(%c1_weight, meta[relay.Constant][0] /* ty=Tensor[(16), float32] */, 0 /* ty=int32 */, out_dtype="int8", axis=0) /* ty=Tensor[(16, 16, 3, 3), int8] */;
  %2 = layout_transform(%0, src_layout="NCHW", dst_layout="NHWC") /* ty=Tensor[(1, 16, 16, 16), uint8] */;
  %3 = layout_transform(%1, src_layout="OIHW", dst_layout="HWIO") /* ty=Tensor[(3, 3, 16, 16), int8] */;
  %4 = qnn.conv2d(%2, %3, 0 /* ty=int32 */, 0 /* ty=int32 */, 0.00786161f /* ty=float32 */, meta[relay.Constant][0] /* ty=Tensor[(16), float32] */, padding=[0, 0, 0, 0], channels=16, kernel_size=[3, 3], data_layout="NHWC", kernel_layout="HWIO", out_dtype="int32") /* ty=Tensor[(1, 14, 14, 16), int32] */;
  %5 = layout_transform(%4, src_layout="NHWC", dst_layout="NCHW") /* ty=Tensor[(1, 16, 14, 14), int32] */;
  %6 = qnn.quantize(%c1_bias, meta[relay.Constant][1] /* ty=Tensor[(16), float32] */, 0 /* ty=int32 */, out_dtype="int32", axis=0) /* ty=Tensor[(16), int32] */;
  %7 = nn.bias_add(%5, %6) /* ty=Tensor[(1, 16, 14, 14), int32] */;
  %8 = qnn.requantize(%7, meta[relay.Constant][2] /* ty=Tensor[(16), float32] */, 0 /* ty=int32 */, 0.0175786f /* ty=float32 */, 64 /* ty=int32 */, axis=1, out_dtype="int32") /* ty=Tensor[(1, 16, 14, 14), int32] */;
  %9 = clip(%8, a_min=0f, a_max=255f) /* ty=Tensor[(1, 16, 14, 14), int32] */;
  cast(%9, dtype="uint8") /* ty=Tensor[(1, 16, 14, 14), uint8] */
} /* ty=fn (Tensor[(1, 16, 16, 16), float32], Tensor[(16, 16, 3, 3), float32], Tensor[(16), float32]) -> Tensor[(1, 16, 14, 14), uint8] */

Given that NHWC → NCHW transformation happens before qnn.quantize, this means our qnn.quantize op cannot operate on NHWC layout at the moment. So the issue is not in bias_add.

To fix that, you need to add a so-called InferCorrectLayout function for this op. This is where, for example, you check the input layout and if that’s NHWC, you need to modify the axis attribute of this op to take into account the change in the input layout. You register such function here https://github.com/apache/tvm/blob/main/src/relay/qnn/op/quantize.cc#L182-L183. See how it is done for qnn.requantize op https://github.com/apache/tvm/blob/main/src/relay/qnn/op/requantize.cc#L563.

Thanks for the information! I’ll look into this and see if I can also spin this off as a pull request.

I’ve spent some time on this implementation. Is there a way to plug into gdb when debugging TVM? I don’t understand where the new_in_layouts and new_out_layouts come from and what they mean.

I think I would also need some information on where these Layouts originate from.

Unfortunately, ConvertLayout pass and InferCorrectLayout are notorious for its implementation opaque-ness and we had a lot of bugs from them. I don’t recommend trying to understand this code, instead I’d start by just copy-pasting an InferCorrectLayout definition for other op that needs a similar transformation.

InferCorrectLayout is called from https://github.com/apache/tvm/blob/2cafa87b10c6124f1a08af7ead712f29b9039762/src/relay/transforms/transform_layout.h#L266 and new_in_layouts etc come from there. new_in_layouts means the new layout the inputs to your op would have after transformation (in your case NHWC), and old_in_layouts (note: not new_out_layouts) are the original input layouts.

I don’t use gdb with TVM other than when debugging a segfault. I know some folks have successfully used it to set break points etc, but I just use LOG(INFO) << .... You can try LOG(INFO) << new_in_layouts;.

I’ve tried copy-pasting over the InferCorrectLayout from requantize.cc and adapt the parameters. But when I use LOG, both new_in_layouts and old_in_layouts show all elements to be nullptr. Is that something that is supposed to happen? It doesn’t seem like it.

I remember InferCorrectLayout runs multiple times, and on the first time you might see something like that. What happens if you do something like https://github.com/apache/tvm/blob/main/src/relay/qnn/op/requantize.cc#L115-L117 on such input?

I copied over the entire function from requantize.cc, so that part is already there. Here is the log output:

[08:36:45] tvm/src/relay/qnn/op/quantize.cc:120: new_in_layouts: (nullptr)
[08:36:45] tvm/src/relay/qnn/op/quantize.cc:121: old_in_layouts: [(nullptr), (nullptr), (nullptr)]
[08:36:45] tvm/src/relay/qnn/op/quantize.cc:122: input_layouts: [(nullptr), Layout(C), Layout(C), Layout(C), Layout(C)]
[08:36:45] tvm/src/relay/qnn/op/quantize.cc:123: output_layouts: [(nullptr)]
[08:36:45] tvm/src/relay/qnn/op/quantize.cc:120: new_in_layouts: (nullptr)
[08:36:45] tvm/src/relay/qnn/op/quantize.cc:121: old_in_layouts: [(nullptr), (nullptr), (nullptr)]
[08:36:45] tvm/src/relay/qnn/op/quantize.cc:122: input_layouts: [(nullptr), Layout(C), Layout(C), Layout(C), Layout(C)]
[08:36:45] tvm/src/relay/qnn/op/quantize.cc:123: output_layouts: [(nullptr)]
[08:36:45] tvm/src/relay/qnn/op/quantize.cc:120: new_in_layouts: (nullptr)
[08:36:45] tvm/src/relay/qnn/op/quantize.cc:121: old_in_layouts: [(nullptr), (nullptr), (nullptr)]
[08:36:45] tvm/src/relay/qnn/op/quantize.cc:122: input_layouts: [(nullptr), Layout(C), Layout(C), Layout(C), Layout(C)]
[08:36:45] tvm/src/relay/qnn/op/quantize.cc:123: output_layouts: [(nullptr)]

Since I copied over the version from requantize, this is the if-branch that is taken: https://github.com/apache/tvm/blob/main/src/relay/qnn/op/requantize.cc#L104-L112

If you get stuck, feel free to open an issue with a repro test. I can take a look.

I’ve opened an issue here: https://github.com/apache/tvm/issues/14835

Thank you for the help!

Sorry, I just realized that qnn.quantize operates on the bias, not the output of qnn.conv2d. So this is indeed an issue of bias_add.

Is there something I can do about this via this InferCorrectLayout function? I have tried writing up something for bias_add but I don’t understand what is needed for me to return from this function. I would expect that I can just pass back the old layout like this:

InferCorrectLayoutOutput BiasAddInferCorrectLayout(const Attrs& attrs,
                                                const Array<Layout>& new_in_layouts,
                                                const Array<Layout>& old_in_layouts,
                                                const Array<tvm::relay::Type>& old_in_types) {
  return InferCorrectLayoutOutput(old_in_layouts, old_in_layouts, attrs);
}

But that doesn’t get rid of the layout transformation, so that can’t be it.

ok after some trial and error, I found that following definition works infer layout for bias add · apache/tvm@4ea870b · GitHub

You are welcome to build on this and send a PR. The PR should come with test cases for NCHW → NHWC and the other direction as well.

To create the input mod without torch, you can use our text parser:

mod = tvm.relay.fromtext("""
#[version = "0.0.5"]
def @main(%layer1_input: Tensor[(1, 16, 16, 16), float32] /* span=aten::quantize_per_tensor_0.layer1_input:0:0 */, %c1_weight: Tensor[(16, 16, 3, 3), float32] /* span=quantized::conv2d_0:0:0 */, %c1_bias: Tensor[(16), float32] /* span=quantized::conv2d_0:0:0 */) {
  %0 = qnn.quantize(%layer1_input, 0.00392099f /* span=aten::quantize_per_tensor_0:0:0 */, 0 /* span=aten::quantize_per_tensor_0:0:0 */, out_dtype="uint8", axis=1) /* span=aten::quantize_per_tensor_0:0:0 */;
  %1 = qnn.quantize(%c1_weight, 0.00065264f /* span=quantized::conv2d_0:0:0 */, -1 /* span=quantized::conv2d_0:0:0 */, out_dtype="int8", axis=0) /* span=quantized::conv2d_0:0:0 */;
  %2 = qnn.conv2d(%0, %1, 0 /* span=quantized::conv2d_0:0:0 */, -1 /* span=quantized::conv2d_0:0:0 */, 0.00392099f /* span=quantized::conv2d_0:0:0 */, 0.00065264f /* span=quantized::conv2d_0:0:0 */, padding=[0, 0, 0, 0], channels=16, kernel_size=[3, 3], out_dtype="int32") /* span=quantized::conv2d_0:0:0 */;
  %3 = qnn.quantize(%c1_bias, 2.55899e-06f /* span=quantized::conv2d_0:0:0 */, 0 /* span=quantized::conv2d_0:0:0 */, out_dtype="int32", axis=0) /* span=quantized::conv2d_0:0:0 */;
  %4 = nn.bias_add(%2, %3) /* span=quantized::conv2d_0:0:0 */;
  %5 = qnn.requantize(%4, 2.55899e-06f /* span=quantized::conv2d_0:0:0 */, 0 /* span=quantized::conv2d_0:0:0 */, 0.00731119f /* span=quantized::conv2d_0:0:0 */, 121 /* span=quantized::conv2d_0:0:0 */, axis=1, out_dtype="int32") /* span=quantized::conv2d_0:0:0 */;
  %6 = clip(%5, a_min=0f, a_max=255f) /* span=quantized::conv2d_0:0:0 */;
  cast(%6, dtype="uint8") /* span=quantized::conv2d_0:0:0 */
}
""")