[QNN][PyTorch][BYOC] Full integer QNN support?

Hi, I’ve been trying to use TVM and BYOC to deploy QNN models on an NPU which supports full integer QNN flow. However, when I import a pre-quantized model produced by PyTorch, all qint8 weights are converted into fp32 params tensors, and additional qnn.quantize are inserted before qnn.conv2d to convert the weights back into int8.

First few layers of converted Relay of quantized ResNet-18 model from torchvision.models.quantization.resnet:

def @main(%input: Tensor[(1, 3, 224, 224), float32], %conv1_weight: Tensor[(64, 3, 7, 7), float32], %conv1_bias: Tensor[(64), float32], ...) {
  %0 = qnn.quantize(%input, 0.018622f, 114, out_dtype="uint8", axis=1);
  %1 = nn.pad(%0, 114f, pad_width=[[0, 0], [0, 0], [3, 3], [3, 3]]);
  %2 = qnn.quantize(%conv1_weight, 0.00308922f, 0, out_dtype="int8", axis=0);
  %3 = qnn.conv2d(%1, %2, 114, 0, 0.018622f, 0.00308922f, strides=[2, 2], padding=[0, 0, 0, 0], channels=64, kernel_size=[7, 7], out_dtype="int32");
  %4 = qnn.quantize(%conv1_bias, 5.75275e-05f, 0, out_dtype="int32", axis=0);
  %5 = nn.bias_add(%3, %4);
  %6 = qnn.requantize(%5, 5.75275e-05f, 0, 0.0146432f, 0, axis=1, out_dtype="int32");
  ...

I don’t know why the PyTorch frontend is designed to map QNN ops in this way, if I want to export a compiled library, all the weights are stored in fp32, and all the extra qnn.quantize ops will have performance cost.

My personal preferred way to map PyTorch QNN ops is to copy the original qint8 weights directly into int8 params tensors, and since PyTorch does not quantize bias, the frontend could quantize the fp32 bias into int32 params tensors with the scales information of weights and input of current conv layer (scale_bias = scale_weights * scale_input). In this way, the mapped Relay will be like:

def @main(%input: Tensor[(1, 3, 224, 224), float32], %conv1_weight: Tensor[(64, 3, 7, 7), int8], %conv1_bias: Tensor[(64), int32],  ...) {
  %0 = qnn.quantize(%input, 0.018622f, 114, out_dtype="uint8", axis=1);
  %1 = nn.pad(%0, 114f, pad_width=[[0, 0], [0, 0], [3, 3], [3, 3]]);
  %2 = qnn.conv2d(%1, %conv1_weight, 114, 0, 0.018622f, 0.00308922f, strides=[2, 2], padding=[0, 0, 0, 0], channels=64, kernel_size=[7, 7], out_dtype="int32");
  %3 = nn.bias_add(%2, %conv1_bias);
  %4 = qnn.requantize(%3, 5.75275e-05f, 0, 0.0146432f, 0, axis=1, out_dtype="int32");
  ...

Will there be updates on the PyTorch frontend to support this scheme?

Quantized pytorch models store quantized weights in a custom packed format, so we cannot directly access 8 bit weights. So we unpack the original packed weight into fp32 using a PyTorch function, convert fp32 tensor to numpy, and apply qnn.quantize to get quantized weights back. Then weight quantization happens at compile time during relay.build(...), by the constant folding pass.

But you are right in that this poses a problem for BYOC flow, because we cannot apply constant folding on a QNN graph. So right now, a BYOC backend needs to quantize the weight during build themselves.

Unfortunately, this is the only way to get int8 weights at compile time for BYOC flow. In hindsight, it would have probably been better to quantize the weight in the frontend, by implementing quantize in numpy.

I believe this is already what PT frontend does.

Thanks for your reply. Right now I’m using the backend::contrib::JSONSerializeras the base of my BYOC codegen compiler, which does not have capability of modifying constant tesnors.

Also I tried the ONNX frontend and it is able to import the quantized weights and bias directly.

So will there be any updates on the PyTorch frontend in the future to support quantizing the weights and bias? I believe it’s important to have a consistent QNN Relay representation across different frontends so the backend developers don’t need to adapt to all the frontends themselves.

You don’t need to modify constant tensors at Relay level. Presumably you have a codegen tool chain for your NPU, you send FP32 tensors to that tool chain and quantize constant weights there.

But I have to admit that this is a terrible workflow and there is no reason weight quantization cannot be done by the PyTorch frontend. This was a design mistake that I didn’t realize until using PT quantized models with the BYOC flow. I’ll update the frontend to optionally quantize weights at Numpy level and return int8 weights to users. cc @elenkalda-arm

What exactly do you mean by this? As far as I know, PT doesn’t support exporting quantized models to ONNX, and quantization support for our ONNX frontend is still on-going.

Unfortunately, the NPU I use only has a runtime library, so there’s no way to do compile-time quantization.

I guess it’s best to wait for the PT frontend update then. Thanks for your help.

I used onnxruntime to quantize a FP32 onnx model into UINT8 model, it can be imported by tvm’s ONNX frontend (And you’re right, the ONNX frontend still needs to support more quantization ops, right now the mapped Q-ops are quite limited)

@Nullko A new option to return int8 parameters is added in https://github.com/apache/tvm/pull/9135

Great! Thank you for the update!