About the output_scale and output_zero_point of qnn.requantize is limited to a scalar

chaochao · August 19, 2022, 3:49pm

the following code is extracted from “requantize.cc” ICHECK(IsScalarType(types[3], DataType::Float(32))); // output_scale ICHECK(IsScalarType(types[4], DataType::Int(32))); // output_zero_point

the output_scale and output_zero_point of qnn.requantize is limited to a scalar, but in reality,I met some situation–both of the output scale and zero point were tensor.

I can’t understand the output scale must be a scalar.

masahi · August 19, 2022, 8:08pm

We don’t support per-channel quantization for activation. Per-channel is only supported for weight.

I think this convention applies to most DL frameworks and this is also what various quantization white papers recommend. What exactly do you mean by “met in reality”?

chaochao · November 19, 2022, 5:05am

in this case： free_var %input: Tensor[(1, 1, 112, 112), float32] /* ty=Tensor[(1, 1, 112, 112), float32] /; %0 = subtract(%input, meta[relay.Constant][0] / ty=Tensor[(1, 1, 1, 1), float32] /) / ty=Tensor[(1, 1, 112, 112), float32] /; %1 = subtract(%0, meta[relay.Constant][1] / ty=Tensor[(1), float32] /) / ty=Tensor[(1, 1, 112, 112), float32] /; %2 = qnn.quantize(%1, 1f / ty=float32 /, 0 / ty=int32 /, out_dtype=“int8”) / ty=Tensor[(1, 1, 112, 112), int8] /; %3 = clip(%2, a_min=-128f, a_max=127f) / ty=Tensor[(1, 1, 112, 112), int8] /; %4 = qnn.quantize(meta[relay.Constant][2] / ty=Tensor[(32, 1, 3, 3), float32] /, 1f / ty=float32 /, 0 / ty=int32 /, out_dtype=“int8”, axis=0) / ty=Tensor[(32, 1, 3, 3), int8] /; %5 = multiply(3.8147e-06f / ty=float32 /, meta[relay.Constant][3] / ty=Tensor[(32), float32] /) / ty=Tensor[(32), float32] /; %6 = qnn.conv2d(%3, %4, 0 / ty=int32 /, 0 / ty=int32 /, 1f / ty=float32 /, %5, strides=[2, 2], padding=[1, 1, 1, 1], channels=32, kernel_size=[3, 3], out_dtype=“int32”); %7 = qnn.quantize(meta[relay.Constant][4] / ty=Tensor[(32), float32] /, 1f / ty=float32 /, 0 / ty=int32 /, out_dtype=“int32”) / ty=Tensor[(32), int32] /; nn.bias_add(%6, %7) / ty=Tensor[(1, 32, 56, 56), float32] */

quantize(1) +dequantize2 +conv2d+quantize(2)+bias:

the conv2d will make tensor affintype （x_t.sacle*w_t.scale），when conv2d is perchannel，and its tensor affintype also an tensor but not a scale，so when conv2d‘’s next node is quantize node（such as quantize（2）），we need insert an requantize node，and the output scale of requantize node is conv2d‘’s affintype.

chaochao · November 19, 2022, 5:17am

you can see this code:

@register_fake_quantization_to_integer(“nn.bias_add”)

def bias_add(expr, type_map):

"""Rewrite a bias_add op"""

x, b = expr.args

x_t = type_map[x]

if b in type_map:

    # Ensure bias matches the previous op

    b_t = type_map[b]

    in_scale = fold_constant(x_t.scale)

    in_zero_point = fold_constant(x_t.zero_point)

    if not (

        approx_equal(x_t.scale, b_t.scale)

        and approx_equal(x_t.zero_point, b_t.zero_point)

        and tvm.ir.structural_equal(x_t.dtype, b_t.dtype)

    ):

        b = relay.qnn.op.requantize(

            b,

            b_t.scale,

            b_t.zero_point,

            in_scale,

            in_zero_point,

            out_dtype=x_t.dtype,

            axis=0,

        )

else:

    # If the bias is a constant, we need to quantize it

    assert isinstance(b, relay.expr.Constant)

    assert b.checked_type.dtype in ["float32", "float64", "float16", "bfloat16"]

    b = relay.qnn.op.quantize(b, x_t.scale, x_t.zero_point, axis=0, out_dtype=x_t.dtype)

out = relay.op.nn.bias_add(x, b, **expr.attrs)

return [out, x_t]

################
b = relay.qnn.op.requantize(

            b,

            b_t.scale,

            b_t.zero_point,

            in_scale,

            in_zero_point,

            out_dtype=x_t.dtype,

            axis=0,

        )

and the opuput scale of requantize is “in_scale”

chaochao · November 19, 2022, 5:46am

I think qnn.add/qnn.mul…should accept tensor as input scale or output scale to avoid such case.