Quantization Configuration Documentation?

Hi,

I was wondering if someone could clarify the meaning and impact of the different quantization parameters, in particular store_lowbit_output and global_scale?. Also which are valid values for each parameter?. Unfortunately, this is not properly documented, or at least I have not been able to find this.

   _node_defaults = {
        "nbit_input": 8,
        "nbit_weight": 8,
        "nbit_activation": 32,
        "dtype_input": "int8",
        "dtype_weight": "int8",
        "dtype_activation": "int32",
        "global_scale": 8.0,
        "skip_conv_layers": [0],
        "round_for_shift": True,
        "store_lowbit_output": True,
        "debug_enabled_ops": None,
}

Thanks

@vinx13 Could you please clarify me this? Or if there any documentation about it?

Global scale is an alternative to dom scale obtained via calibration

Thanks! I missed those comment in that file. Still for me is not clear the difference between dtype_input and nbit_input and the same for weights and activations. In other words, what is the difference between the number of bits and the data type in each case.

When I set store_lowbit_output to false the first conv2d in my model is not quantized. Could you please tell me why is that in relation to store_lowbit_output?

First conv2d layer is by default not quantized, there is a option skip_conv_layers (or skip_k_conv in older version), it has nothing to do with store_lowbit_output

In current implementation they are the same. But it offers possibility to use other dtypes such as uint8

Are the nbit_* arguments the maximum number of bits used or a fixed number of bits used for all weights, inputs, activations?

Or, Does the calibration step decides the number of bits in each case? If so how can I observe the number of bits selected after the realization step?

It a fixed number of bits. Calibration uses it to decide the maximum of a data type

This is very strange because only with store_lowbit_output=False I see the first Conv2d not quantized:

WARNING:autotvm:Cannot find config for target=cuda, workload=('conv2d', (..., 'float32'), (..., 'float32')

With store_lowbit_output=True I see the following:

WARNING:autotvm:Cannot find config for target=cuda, workload=('conv2d', (..., 'int8'), (..., 'int8')

Note that in both cases I used skip_k_conv=None

Is there any way to visualize the number of bits finally selected by calibration?

#bits is a fixed number in your config.

Can you print the quantized result and confirm whether first layer is quantized?
net = relay.quantize(…)
print(net)

No is not being quantized:

%2 = nn.conv2d(%1, meta[relay.Constant][0] /* ty=Tensor[(....), float32] ......

There might be something wrong with annotation. Can you check the result of annotate pass?

What is the best way to do that?

dump result of Annotate in https://github.com/dmlc/tvm/blob/b3f3ab5593c1949947c9872c8df1479975116a95/python/tvm/relay/quantize/quantize.py#L360
and check whether conv2d_rewrite https://github.com/dmlc/tvm/blob/b3f3ab5593c1949947c9872c8df1479975116a95/python/tvm/relay/quantize/_annotate.py#L164
works for the first layer

I see in https://github.com/dmlc/tvm/blob/b3f3ab5593c1949947c9872c8df1479975116a95/python/tvm/relay/quantize/quantize.py#L360 that skip_k_conv=[0] although I set it to None, so is not reading this argument corrently and instead using the default one.

I also try to set the skip to other Conv2D layer but it does not work.

I found the error. I was using skip_k_conv instead of skip_conv_layers. I guess the name changed in the mean time.

Another question: What is then the purpose of calibration?

it calculates dom_scale and other params for each simulated_quantize node