Hi, I’m trying to import pre-quantized ONNX and PyTorch models with TVM frontend and I noticed that some op patterns in the imported Relay module can be fused into a single qnn op.
For instance, a imported quantized ONNX ResNet-18 model:
def @main(%input: Tensor[(1, 3, 224, 224), float32]) -> Tensor[(1, 1000), float32] {
%0 = qnn.quantize(%input, 0.0186584f /* ty=float32 */, 114 /* ty=int32 */, out_dtype="uint8", axis=1) /* ty=Tensor[(1, 3, 224, 224), uint8] */;
%1 = qnn.conv2d(%0, meta[relay.Constant][0] /* ty=Tensor[(64, 3, 7, 7), uint8] */, 114 /* ty=int32 */, 128 /* ty=int32 */, 0.0186584f /* ty=float32 */, 0.00308922f /* ty=float32 */, strides=[2, 2], padding=[3, 3, 3, 3], channels=64, kernel_size=[7, 7], out_dtype="int32") /* ty=Tensor[(1, 64, 112, 112), int32] */;
%2 = nn.bias_add(%1, meta[relay.Constant][1] /* ty=Tensor[(64), int32] */) /* ty=Tensor[(1, 64, 112, 112), int32] */;
%3 = qnn.requantize(%2, 5.764e-05f /* ty=float32 */, 0 /* ty=int32 */, 0.0281708f /* ty=float32 */, 0 /* ty=int32 */, axis=0, out_dtype="uint8") /* ty=Tensor[(1, 64, 112, 112), uint8] */;
%4 = nn.max_pool2d(%3, pool_size=[3, 3], strides=[2, 2], padding=[1, 1, 1, 1]) /* ty=Tensor[(1, 64, 56, 56), uint8] */;
%5 = qnn.conv2d(%4, meta[relay.Constant][2] /* ty=Tensor[(64, 64, 3, 3), uint8] */, 0 /* ty=int32 */, 128 /* ty=int32 */, 0.0281708f /* ty=float32 */, 0.00293729f /* ty=float32 */, padding=[1, 1, 1, 1], channels=64, kernel_size=[3, 3], out_dtype="int32") /* ty=Tensor[(1, 64, 56, 56), int32] */;
%6 = nn.bias_add(%5, meta[relay.Constant][3] /* ty=Tensor[(64), int32] */) /* ty=Tensor[(1, 64, 56, 56), int32] */;
%7 = qnn.requantize(%6, 8.27459e-05f /* ty=float32 */, 0 /* ty=int32 */, 0.0205264f /* ty=float32 */, 0 /* ty=int32 */, axis=0, out_dtype="uint8") /* ty=Tensor[(1, 64, 56, 56), uint8] */;
%8 = qnn.conv2d(%7, meta[relay.Constant][4] /* ty=Tensor[(64, 64, 3, 3), uint8] */, 0 /* ty=int32 */, 128 /* ty=int32 */, 0.0205264f /* ty=float32 */, 0.00604534f /* ty=float32 */, padding=[1, 1, 1, 1], channels=64, kernel_size=[3, 3], out_dtype="int32") /* ty=Tensor[(1, 64, 56, 56), int32] */;
%9 = nn.bias_add(%8, meta[relay.Constant][5] /* ty=Tensor[(64), int32] */) /* ty=Tensor[(1, 64, 56, 56), int32] */;
%10 = qnn.requantize(%9, 0.000124089f /* ty=float32 */, 0 /* ty=int32 */, 0.0459817f /* ty=float32 */, 151 /* ty=int32 */, axis=0, out_dtype="uint8") /* ty=Tensor[(1, 64, 56, 56), uint8] */;
%11 = qnn.dequantize(%10, 0.0459817f /* ty=float32 */, 151 /* ty=int32 */) /* ty=Tensor[(1, 64, 56, 56), float32] */;
%12 = qnn.dequantize(%4, 0.0281708f /* ty=float32 */, 0 /* ty=int32 */) /* ty=Tensor[(1, 64, 56, 56), float32] */;
%13 = add(%11, %12) /* ty=Tensor[(1, 64, 56, 56), float32] */;
%14 = qnn.quantize(%13, 0.0278099f /* ty=float32 */, 0 /* ty=int32 */, out_dtype="uint8") /* ty=Tensor[(1, 64, 56, 56), uint8] */;
...
where around %13
, the quant(add(dequant(a), dequant(b)))
can be fused into a single qnn.add
op.
Another example, a imported quantized PyTorch ResNet-18 model:
def @main(%input: Tensor[(1, 3, 224, 224), float32]) -> Tensor[(1, 1000), float32] {
%0 = qnn.quantize(%input, 0.018622f, 114, out_dtype="uint8", axis=1);
%1 = nn.pad(%0, 114f, pad_width=[[0, 0], [0, 0], [3, 3], [3, 3]]);
%2 = qnn.conv2d(%1, %conv1_weight, 114, 0, 0.018622f, 0.00308922f, strides=[2, 2], padding=[0, 0, 0, 0], channels=64, kernel_size=[7, 7], out_dtype="int32");
%3 = nn.bias_add(%2, %conv1_bias);
%4 = qnn.requantize(%3, 5.75275e-05f, 0, 0.0146432f, 0, axis=1, out_dtype="int32");
%5 = clip(%4, a_min=0f, a_max=255f);
%6 = cast(%5, dtype="uint8");
...
where requant: int32 -> clip(0, 255) -> cast: uint8
can be fused into a single qnn.requantize
op.
Sure I can capture these patterns in the BYOC partitioning, but in that way I have to handle them in the C++ codegen. Is there an easy way to do such op fusions in python? I tried the DFPatternCallback
and tvm.relay.dataflow_pattern.rewrite
, but they cannot rewrite all matched pattern in a module.