Hi, I’ve been trying to use TVM and BYOC to deploy QNN models on an NPU which supports full integer QNN flow. However, when I import a pre-quantized model produced by PyTorch, all qint8 weights are converted into fp32 params tensors, and additional qnn.quantize are inserted before qnn.conv2d to convert the weights back into int8.
First few layers of converted Relay of quantized ResNet-18 model from torchvision.models.quantization.resnet:
def @main(%input: Tensor[(1, 3, 224, 224), float32], %conv1_weight: Tensor[(64, 3, 7, 7), float32], %conv1_bias: Tensor[(64), float32], ...) {
%0 = qnn.quantize(%input, 0.018622f, 114, out_dtype="uint8", axis=1);
%1 = nn.pad(%0, 114f, pad_width=[[0, 0], [0, 0], [3, 3], [3, 3]]);
%2 = qnn.quantize(%conv1_weight, 0.00308922f, 0, out_dtype="int8", axis=0);
%3 = qnn.conv2d(%1, %2, 114, 0, 0.018622f, 0.00308922f, strides=[2, 2], padding=[0, 0, 0, 0], channels=64, kernel_size=[7, 7], out_dtype="int32");
%4 = qnn.quantize(%conv1_bias, 5.75275e-05f, 0, out_dtype="int32", axis=0);
%5 = nn.bias_add(%3, %4);
%6 = qnn.requantize(%5, 5.75275e-05f, 0, 0.0146432f, 0, axis=1, out_dtype="int32");
...
I don’t know why the PyTorch frontend is designed to map QNN ops in this way, if I want to export a compiled library, all the weights are stored in fp32, and all the extra qnn.quantize ops will have performance cost.
My personal preferred way to map PyTorch QNN ops is to copy the original qint8 weights directly into int8 params tensors, and since PyTorch does not quantize bias, the frontend could quantize the fp32 bias into int32 params tensors with the scales information of weights and input of current conv layer (scale_bias = scale_weights * scale_input). In this way, the mapped Relay will be like:
def @main(%input: Tensor[(1, 3, 224, 224), float32], %conv1_weight: Tensor[(64, 3, 7, 7), int8], %conv1_bias: Tensor[(64), int32], ...) {
%0 = qnn.quantize(%input, 0.018622f, 114, out_dtype="uint8", axis=1);
%1 = nn.pad(%0, 114f, pad_width=[[0, 0], [0, 0], [3, 3], [3, 3]]);
%2 = qnn.conv2d(%1, %conv1_weight, 114, 0, 0.018622f, 0.00308922f, strides=[2, 2], padding=[0, 0, 0, 0], channels=64, kernel_size=[7, 7], out_dtype="int32");
%3 = nn.bias_add(%2, %conv1_bias);
%4 = qnn.requantize(%3, 5.75275e-05f, 0, 0.0146432f, 0, axis=1, out_dtype="int32");
...
Will there be updates on the PyTorch frontend to support this scheme?