[RFC][Quantization] A new quantization framework in TVM: initial RFC (1/4)

@mikeseven Yes, the goal is to create a fully quantized graph, and we do recognize that this transformation will change the output of the graph. For this reason, we’re not going to present the rewrite as a Relay pass. And I definitely agree that we should let there be user-defined handling.

Also, we definitely have been thinking about simulating accumulation in affine space. For int8 input datatypes with int32 accumulation, simulating int32 accumulation is probably not super important since there’s a low likelihood of overflow. Therefore we’re hoping to deal with it in the multi-dtype extension. One option for doing this is creating another simulated QNN op that simulates overflow for a given dtype.

We do want to support propogating error from previous operators while calibrating the current conv2d operator.

Additionally, since qnn.simulated_quantize does actually move the data into affine space, qnn.simulated_quantize -> nn.conv2d -> qnn.simulated_dequantize is actually incorrect, since nn.conv2d doesn’t take non-zero zero points into account. And, since we will eventually extend QNN to support multiple dtypes anyways, it’s not that much effort to add fp32 as a dtype.

I’m not sure I understand what you’re saying here. Like I said above, if we do simulated quantization instead of fake quantization, then we need to take zero points into account for every op that’s in affine space. Were you thinking we’d do something like this: ​

qnn.simulated_quantize -> qnn.simulated_dequantize -> nn.conv2d -> qnn.simulated_quantize -> qnn.simulated_dequantize.

(ie we’d use the simulated quantize ops do fake quantization?)

I think that yes, that graph could be used for BYOC if the BYOC people want. However, that graph will still have some ops in real space that the BYOC people would need to transform into affine space, whereas the output of our final rewrite will be completely in affine space.

It’s not clear to me whether it’s easier to transform real Relay ops into affine-space BYOC or affine-space Relay ops into BYOC.

Thanks Lily. Agree :wink:

1 Like

Also, as part of the standardization of QNN, we could ensure that all QNN “compute” ops go from int8 -> int8 . I believe that qnn.conv2d is the only QNN op that outputs an accumulation dtype, so we could change qnn.conv2d to take in bias in addition to the data and weight.

Thanks, that makes sense. I was thinking that while calibration, you could use different attributes for simulated_quantize and simulated_dequantize ops. In the callback of calibrating an operator, one can simulate the affine space and argue about scales and zero points. But for capturing real values, you could use the passthrough feature of simulated ops to prevent any error. In this case, qnn.simulated_quantize (passthrough)nn.conv2dqnn.simualted_dequantize (passthrough) will work. But, I read your earlier RFC, and I think you are also maintaining the original graph to find the real tensor values without any error if needed. So, it makes sense to me.

1 Like

yes conv (1d/2d/3d)+bias is a typical quantized op with accumulator. And so is conv+bias+relu. Fully connected/matmul too.

It might be best to actually relax the invariant on QNN ops from affine spaceaffine space to real spacereal space.

It fits more in line with shoving implementation details into QNN.

Take add for example. How it’s implemented in the current QNN is we take in our input tensors and (re)quantize them to the output quantization parameters. If the way we quantize the inputs depends on the implementation of our operator sometimes then it makes sense to let QNN control quantization.

@AndrewZhaoLuo excellent point. In quantization frameworks, there are few points of control. For example, at the operator itself (QNN ops) and also at the framework that controls operators.

At operator level, while the equations are the same, the calculations of the parameters may be different from one framework (e.g. TF vs PyTorch) to another, or some hardware may have more optimized ISA for some functions (e.g. uint8 vs int8, full range vs. restricted range, vectorization, …).

The choice at QNN op level has impact on the framework controlling them to minimize quantization error and to maximize performance across the model.

These controls are useful for preserving accuracy of pre-quantized models, fine tuning for specific devices, and full-blown quantization.

I think the key here is to provide statistics to let user-defined quantization strategy to decide how well the “invariant” must be preserved.

@mikeseven @AndrewZhaoLuo I do think doing real_space -> real_space would be a better invariant for QNN. @mbrookhart and I were discussing the fact that when you use qnn.conv2d, the requantize that follows it needs to have an input scale that is dependent on the input scales to the qnn.conv2d.

Concretely, if you have qnn.conv2d(data, s1, zp1, s2, zp2), the requantize that follows it must be requantize(qnn.conv2d, s1 * s2, 0, new_scale, new_zp).

This makes it really easy to introduce errors during the graph rewrite and moreover is a headache because there are more things to keep track of during the rewrite…

I definitely need to think more about how this would change the structure of our rewrite and existing code. At a certain point, we will have to link QNN ops together so that we are only operating in affine space and there is no transition back to real space in between affine-space regions. It’s not clear to me how to do this without violating the invariant that QNN always goes from real space to real space.

Please correct me if I’m wrong. The way I understand real to real invariant is that quantization operations are carried in fp32. If scale and zero point are also in fp32 then, all quantization operations being linear, no error, fully invariant. Once you start using lower bit representation, invariant doesn’t hold.

@mikeseven I think the real to real invariant strictly refers to the inputs and outputs of the QNN ops, not what is happening inside. Specifically, we’re talking about whether we have scaled and shifted (regardless of dtype). So in the “real to real” invariant, unscaled data would come in to QNN ops with the quantization parameters, be quantized within the QNN op, the operator would do stuff in affine space, then the output would be scaled and shifted back into real space. Essentially we’d be pushing qnn.quantize and qnn.requantize into the qnn.conv2d, qnn.dense, etc.

However, after some offline discussion, @AndrewZhaoLuo and I are not actually sure if that is the best approach because it makes quantizing weights ahead of time difficult and also introduces complexity into how we link up the QNN ops.

Given the depth of the discussion here and the large search space, maybe we should have an RFC to discuss the changes to QNN and what makes the most sense.

1 Like

Thanks for your great work! I have a question that has anyone converted an ONNX with Q/DQ to RFC directly? Maybe any suggestions are given to me?

Hi @electriclilies,

Thanks for the detailed post. Are the remaining 3 parts of the RFC posted somewhere as well?

Thanks in advance!

No, there is no update on this initiative.

@masahi thank you for your reply. Are there any alternatives to this initiative that I can refer to and use for quantizing fp32 models within tvm?

Unfortunately, there is no good solution for quantization in TVM now. See Status on quantization in TVM and An enabling framework for int8 quantization

@masahi thank you for sharing the links. I’ll go through them.