For baremetal devices, it is desirable (for both space and performance reasons) to have a network that consists entirely of integral data types (most often int8
). However, the automatic integer quantization mechanism in Relay does not serve this use case for two reasons:
 Inputs are assumed to be
float32
, so they are quantized at the network’s prefix, and outputs are forced intofloat32
, so they are dequantized at the network’s suffix.  The quantization pass is geared towards only the most timeconsuming operators (e.g.,
conv2d
anddense
), leaving many others infloat32
.
We propose two improvements to automatic integer quantization that address these problems: quantize/dequantize partitioning and expanded operator coverage.
Quantize/Dequantize Partitioning (WIP PR)
This feature adds a configuration parameter partition_conversions
to Relay’s quantize API that specifies whether to partition a quantized module into a module with the following functions:

quantize_inputs
: convert inputs into the quantized data space 
quantized_main
: run the core network that contains only quantized operators 
dequantize_outputs
: converts outputs into the unquantized data space 
main
: callsquantize_inputs
,quantized_main
, anddequantize_outputs
in succession, resulting in equivalent behavior to a quantized module that has not been partitioned.
If there are unquantized operators in the core network, an exception is raised. The default value is False
.
As an example of this feature in motion, consider the module below:
def @main(%x: Tensor[(1, 4, 16, 16), float32], %w: Tensor[(4, 4, 3, 3), float32]) > Tensor[(1, 4, 16, 16), float32] {
nn.conv2d(%x, %w, padding=[1, 1, 1, 1], channels=4, kernel_size=[3, 3])
}
After quantization, we see three distinct sections of the function (input quantization, core int8
network, and output dequantization), delimited below by the horizontal bars.
def @main(%x: Tensor[(1, 4, 16, 16), float32]) > Tensor[(1, 4, 16, 16), float32] {
%0 = multiply(%x, 16f) /* ty=Tensor[(1, 4, 16, 16), float32] */;
%1 = round(%0) /* ty=Tensor[(1, 4, 16, 16), float32] */;
%2 = clip(%1, a_min=127f, a_max=127f) /* ty=Tensor[(1, 4, 16, 16), float32] */;
%3 = cast(%2, dtype="int8") /* ty=Tensor[(1, 4, 16, 16), int8] */;

%4 = nn.conv2d(
%3,
meta[relay.Constant][0],
padding=[1, 1, 1, 1],
channels=4,
kernel_size=[3, 3],
out_dtype="int32") /* ty=Tensor[(1, 4, 16, 16), int32] */;
%5 = add(%4, meta[relay.Constant][1]) /* ty=Tensor[(1, 4, 16, 16), int32] */;
%6 = right_shift(%5, meta[relay.Constant][2]) /* ty=Tensor[(1, 4, 16, 16), int32] */;
%7 = clip(%6, a_min=127f, a_max=127f) /* ty=Tensor[(1, 4, 16, 16), int32] */;
%8 = cast(%7, dtype="int8") /* ty=Tensor[(1, 4, 16, 16), int8] */;
%9 = annotation.stop_fusion(%8) /* ty=Tensor[(1, 4, 16, 16), int8] */;

%10 = cast(%9, dtype="float32") /* ty=Tensor[(1, 4, 16, 16), float32] */;
multiply(%10, 0.0625f) /* ty=Tensor[(1, 4, 16, 16), float32] */
}
If partition_conversions == True
, then the module above is converted to the module below.
def @quantize_inputs(%x: Tensor[(1, 4, 16, 16), float32]) > (Tensor[(1, 4, 16, 16), int8],) {
%0 = multiply(%x, 16f);
%1 = round(%0);
%2 = clip(%1, a_min=127f, a_max=127f);
(cast(%2, dtype="int8"),)
}
def @quantized_main(%x: Tensor[(1, 4, 16, 16), int8]) > Tensor[(1, 4, 16, 16), int8] {
%0 = nn.conv2d(
%x,
meta[relay.Constant][0],
padding=[1, 1, 1, 1],
channels=4,
kernel_size=[3, 3],
out_dtype="int8");
%1 = add(%0, meta[relay.Constant][1]);
%2 = right_shift(%1, meta[relay.Constant][2]);
%3 = clip(%2, a_min=127f, a_max=127f);
%4 = cast(%3, dtype="int8");
annotation.stop_fusion(%4)
}
def @dequantize_outputs(%x: Tensor[(1, 4, 16, 16), int8]) > Tensor[(1, 4, 16, 16), float32]
%0 = cast(%x, dtype="float32");
multiply(%0, 0.0625f)
}
def @main(%x: Tensor[(1, 4, 16, 16), float32]) > Tensor[(1, 4, 16, 16), float32] {
let %quantized_inputs = @quantize_inputs(%x);
let %quantized_outputs = @quantized_main(%quantized_inputs.0);
@dequantize_outputs(%quantized_outputs)
}
Note: This new option won’t be very helpful on its own until we’ve expanded operator coverage, since most networks will include unquantized operators.
Further Considerations
Along with the quantize/dequantize functions, for IoT applications, even once you have a purely integral network, quantization gives no hints as to how you should convert from raw sensor data into the quantized input space. If you know how to convert from sensor data to float32
, you can run that conversion, then run @quantize_inputs
, but an optimal solution would require no intermediate floatingpoint values. To serve this use case, we may want an additional configuration option that allows the user to specify characteristics of their raw sensor data (e.g., dtype, mean, variance) and we could generate a @quantize_inputs
function tailored to these properties.
Expanded Operator Coverage
The quantization algorithm works by annotating chains of quantizable ops, and when the chain is broken (i.e., a nonquantizable op is encountered), dequantization code is inserted to convert the output of the chain from int*
to float32
. Thus, in order to generate a fully quantized core network, all operators in the network must be quantizable.
As our first goal, we will aim for full quantization of the CIFAR10 CNN featured in the recent µTVM blog post (shown below).
def @main(%data: Tensor[(1, 3, 32, 32), float32], %convolution_W: Tensor[(32, 3, 5, 5), float32], %convolution_B: Tensor[(32), float32], %convoluti
on1_W: Tensor[(32, 32, 5, 5), float32], %convolution1_B: Tensor[(32), float32], %convolution2_W: Tensor[(64, 32, 5, 5), float32], %convolution2_B: Tensor[
(64), float32], %innerProduct_B: Tensor[(10, 1024), float32], %innerProduct_C: Tensor[(10), float32]) > Tensor[(1, 10), float32] {
%0 = nn.conv2d(%data, %convolution_W, padding=[2, 2, 2, 2], kernel_size=[5, 5]);
%1 = nn.bias_add(%0, %convolution_B);
%2 = nn.pad(%1, pad_value=3.40282e+38f, pad_width=[[0, 0], [0, 0], [0, 1], [0, 1]]);
%3 = nn.max_pool2d(%2, pool_size=[3, 3], strides=[2, 2], padding=[0, 0, 0, 0], ceil_mode=True);
%4 = nn.relu(%3);
%5 = nn.conv2d(%4, %convolution1_W, padding=[2, 2, 2, 2], kernel_size=[5, 5]);
%6 = nn.bias_add(%5, %convolution1_B);
%7 = nn.relu(%6);
%8 = nn.pad(%7, pad_width=[[0, 0], [0, 0], [0, 1], [0, 1]]);
%9 = nn.avg_pool2d(%8, pool_size=[3, 3], strides=[2, 2], padding=[0, 0, 0, 0], ceil_mode=True);
%10 = nn.conv2d(%9, %convolution2_W, padding=[2, 2, 2, 2], kernel_size=[5, 5]);
%11 = nn.bias_add(%10, %convolution2_B);
%12 = nn.relu(%11);
%13 = nn.pad(%12, pad_width=[[0, 0], [0, 0], [0, 1], [0, 1]]);
%14 = nn.avg_pool2d(%13, pool_size=[3, 3], strides=[2, 2], padding=[0, 0, 0, 0], ceil_mode=True);
%15 = nn.batch_flatten(%14);
%16 = nn.batch_flatten(%15);
%17 = nn.dense(%16, %innerProduct_B, units=10);
%18 = multiply(1f;
nn.bias_add(%17, %18)
}
When this network is quantized, the following operators are left in float32
space: nn.bias_add
, nn.pad
, nn.max_pool2d
, nn.relu
, nn.avg_pool2d
, and nn.batch_flatten
.
Of these operators, there are actually only three culprits: nn.bias_add
, nn.pad
, and nn.batch_flatten
. The remaining operators can be quantized, but only if they are in the middle of an ongoing chain of quantized operators; the only operators that initiate quantized chains are conv2d
and dense
.
So we will start by enabling support for these operators and gradually expand to support full quantization of other models.