Hi, I’ve been trying to use TVM and BYOC to deploy QNN models on an NPU which supports full integer QNN flow. However, when I import a pre-quantized model produced by PyTorch, all qint8
weights are converted into fp32
params tensors, and additional qnn.quantize
are inserted before qnn.conv2d
to convert the weights back into int8
.
First few layers of converted Relay of quantized ResNet-18 model from torchvision.models.quantization.resnet
:
def @main(%input: Tensor[(1, 3, 224, 224), float32], %conv1_weight: Tensor[(64, 3, 7, 7), float32], %conv1_bias: Tensor[(64), float32], ...) {
%0 = qnn.quantize(%input, 0.018622f, 114, out_dtype="uint8", axis=1);
%1 = nn.pad(%0, 114f, pad_width=[[0, 0], [0, 0], [3, 3], [3, 3]]);
%2 = qnn.quantize(%conv1_weight, 0.00308922f, 0, out_dtype="int8", axis=0);
%3 = qnn.conv2d(%1, %2, 114, 0, 0.018622f, 0.00308922f, strides=[2, 2], padding=[0, 0, 0, 0], channels=64, kernel_size=[7, 7], out_dtype="int32");
%4 = qnn.quantize(%conv1_bias, 5.75275e-05f, 0, out_dtype="int32", axis=0);
%5 = nn.bias_add(%3, %4);
%6 = qnn.requantize(%5, 5.75275e-05f, 0, 0.0146432f, 0, axis=1, out_dtype="int32");
...
I don’t know why the PyTorch frontend is designed to map QNN ops in this way, if I want to export a compiled library, all the weights are stored in fp32
, and all the extra qnn.quantize
ops will have performance cost.
My personal preferred way to map PyTorch QNN ops is to copy the original qint8
weights directly into int8
params tensors, and since PyTorch does not quantize bias, the frontend could quantize the fp32
bias into int32
params tensors with the scales information of weights and input of current conv layer (scale_bias = scale_weights * scale_input
). In this way, the mapped Relay will be like:
def @main(%input: Tensor[(1, 3, 224, 224), float32], %conv1_weight: Tensor[(64, 3, 7, 7), int8], %conv1_bias: Tensor[(64), int32], ...) {
%0 = qnn.quantize(%input, 0.018622f, 114, out_dtype="uint8", axis=1);
%1 = nn.pad(%0, 114f, pad_width=[[0, 0], [0, 0], [3, 3], [3, 3]]);
%2 = qnn.conv2d(%1, %conv1_weight, 114, 0, 0.018622f, 0.00308922f, strides=[2, 2], padding=[0, 0, 0, 0], channels=64, kernel_size=[7, 7], out_dtype="int32");
%3 = nn.bias_add(%2, %conv1_bias);
%4 = qnn.requantize(%3, 5.75275e-05f, 0, 0.0146432f, 0, axis=1, out_dtype="int32");
...
Will there be updates on the PyTorch frontend to support this scheme?