[Quantization] How to expose 'ndom_scale' 'nclip_min' & 'nclip_max' to TOPI or CodeGen

adb · January 14, 2020, 10:55pm

Question to those of you working on data aware quantization. How would I pass down, or access the following values during CodeGen for my backend?

From _calibrate.py

            const_params[ndom_scale] = _make_const(scale / valid_range)
            const_params[nclip_min] = _make_const(- (valid_range - 1))
            const_params[nclip_max] = _make_const((valid_range - 1))

My backend’s source code must be aware of these values for each tensor. Accessing within TOPI would be good enough too.

@vinx13 @masahi @ziheng

P.S. Thanks for the quantization tutorial @vinx13!

vinx13 · January 14, 2020, 11:50pm

Accessing these values in codegen is not easy. During realize pass arithmetic operations (mul, add, …) with these values are inserted for quantize/requantize. In FoldConstant pass of relay, we fold constant expressions. As a result, some of these values are not accessible in topi and codegen

adb · January 15, 2020, 12:20am

@vinx13 Maybe TensorNode should be extended to include these fields? Or maybe we should have a QuantizedTensorNode type where we can store this information per tensor. We should be able to access this data from TOPI then, correct?

vinx13 · January 15, 2020, 12:33am

It is possible, you need to modify the realize pass to save additional info. For example, instead of inserting mul/add in realize, we can insert quantize/requantize operators in Relay IR directly and then lower these operators to TOPI

adb · January 16, 2020, 6:16pm

I think that approach sounds okay for now. So this change would require addition of those two Relay ops. Could another approach be to insert some kind of annotation which only contains quantize info from the previous op? Though I’m not sure I’ve seen annotations passed all the way down to TOPI before.

Eventually I think the tensors themselves should store quantization information about how they were quantized. The reason is that an operator may have potentially many tensors quantized in different ways, and in the future regions of large tensors might be quantized in different ways.

vinx13 · January 16, 2020, 8:10pm

The other approach should also work. You can add some annotation op in relay and topi. In topi, annotation ops are simply identity op

adb · January 21, 2020, 5:06pm

Okay, my work around so far will produce the following IR. the annotation.quantize_info contains ndom_scale, nclip_min & nclip_max

def @main(%X: Tensor[(1, 8), float32]) -> Tensor[(1, 8), float32] {
  %0 = multiply(%X, 8429.23f /* ty=float32 */) /* ty=Tensor[(1, 8), float32] */;
  %1 = round(%0) /* ty=Tensor[(1, 8), float32] */;
  %2 = clip(%1, a_min=-127f, a_max=127f) /* ty=Tensor[(1, 8), float32] */;
  %3 = cast(%2, dtype="int8") /* ty=Tensor[(1, 8), int8] */;
  %4 = nn.dense(%3, meta[relay.Constant][0] /* ty=Tensor[(8, 8), int8] */ /* ty=Tensor[(8, 8), int8] */, units=None, out_dtype="int8") /* ty=Tensor[(1, 8), int8] */;
  %5 = on_device(%4, meta[relay.attrs.OnDeviceAttrs][0]) /* ty=Tensor[(1, 8), int8] */;
  %6 = annotation.quantize_info(%5, meta[relay.attrs.QuantizeInfoAttrs][0]) /* ty=Tensor[(1, 8), int8] */;
  %7 = on_device(%6, meta[relay.attrs.OnDeviceAttrs][1]) /* ty=Tensor[(1, 8), int8] */;
  %8 = cast(%7, dtype="float32") /* ty=Tensor[(1, 8), float32] */;
  multiply(%8, 4.63417e-07f /* ty=float32 */) /* ty=Tensor[(1, 8), float32] */
}

My op

// relay.annotation.quantize_info
TVM_REGISTER_NODE_TYPE(QuantizeInfoAttrs);

RELAY_REGISTER_OP("annotation.quantize_info")
.describe(R"code(Annotate an expression with it's quantization info)code" TVM_ADD_FILELINE)
.set_num_inputs(1)
.add_argument("data", "Tensor", "The input data.")
.add_type_rel("Identity", IdentityRel)
.set_support_level(10)
.set_attr<TOpPattern>("TOpPattern", kElemWise)
.set_attr<TOpIsStateful>("TOpIsStateful", false)
.set_attr<FInferCorrectLayout>("FInferCorrectLayout", ElemwiseArbitraryLayout)
.set_attr<FTVMCompute>("FTVMCompute",
                       [](const Attrs& attrs, const Array<Tensor>& inputs,
                          const Type& out_dtype, const Target& target) -> Array<Tensor> {
                         return {topi::identity(inputs[0])};
                       });

Expr QuantizeInfo(Expr data, double ndom_scale, int nclip_min, int nclip_max) {
  static const Op& op = Op::Get("annotation.quantize_info");
  auto attrs = make_node<QuantizeInfoAttrs>();
  attrs->ndom_scale = ndom_scale;
  attrs->nclip_min = nclip_min;
  attrs->nclip_max = nclip_max;
  return CallNode::make(op, {data}, Attrs(attrs), {});
}

TVM_REGISTER_API("relay.op.annotation._make.quantize_info")
.set_body_typed<Expr(Expr, double, int, int)>([](Expr data, double ndom_scale, int nclip_min, int nclip_max) {
    return QuantizeInfo(data, ndom_scale, nclip_min, nclip_max);
});

It will be fused with nn.dense but the attributes appear empty.

Edit: Was able to resolve this by attaching attributes to the topi::identity call. I can access the values as expected for my backend now.

alter-xp · June 1, 2021, 8:32am

Hi, can this part of work be used in the main branch now?