[RFC][Quantization] A new quantization framework in TVM: initial RFC (1/4)

mikeseven · April 22, 2021, 10:23pm

mbaret:

mikeseven:

For example, various modes of a conv2d may need different quantization schemes. Some hardware may benefit from different quantization schemes due to more efficient instructions.

I agree this is quite important. To add a specific case that’s causing us some trouble at the moment, qnn.requantize is implemented differently in Relay to the quantization scheme in TFLite and therefore we can get quite different results running a network through TVM vs. TFLite. For a single operator this tends to just be an error of +/- 1 from rounding mode differences but propagated through a large network we’ve seen these errors add up.

I think either we need qnn.requantize to be ‘configurable’ to support the different numerical behaviour of different frontends or potentially have different requantize ops entirely.

Yes exactly! I feel your pain too Some frameworks use full range eg [-128,127], others restricted [-127,127]. Even at operator level, using per-channel or per-layer is sometimes not sufficient. A point convolution should not deal with channels the same way a generic conv do. Likewise for a conv with groups etc. or even more fun with fusing/splitting operators.

So the framework should be flexible to use such dedicated quantizers if need be, maybe through some kind of pattern matching.

I like to use the video codec analogy: there are many ways to encode a video but you must have one way for any player to play it back.

In that sense, I see qnn ops as the decoder part, and this framework flexible enough to allow various encoding schemes over time. For validation, we could start with reproducing say TFlite way but we should not be limited by it (because TFlite is very very limited).

mbaret · April 23, 2021, 4:17pm

Agreed, but it would be nice to agree a ‘TVM’ standard way to represent quantization at various levels. That way others (maybe even me ) can start applying that standard to the frontends. It would also make it easier to get accelerators to work with TVM’s auto-quantization.

We need to be a bit careful doing this, because it’s the frontend behaviour that’s different rather than the backend. So any such differing in canonicalization should correspond to an attribute on qnn.requantize that determines the desired numerical behaviour.

I think this is one of the major complications for quantized accelerators - we require that everything is in the affine space. That is to say, we don’t have an ‘integer’ accelerator but specifically a quantized one. So even things like sigmoid which seems strange to run in the affine space at least need to be ‘fake quantized’ with dequantize/quantize so we have access to those affine parameters. TFLite does this relatively neatly by having the quantization parameters as part of the tensor type but unfortunately we can’t do the same thing in TVM.

I’m also hoping for this If we can arrive at a durable and flexible representation for auto-quantization I think it would even be beneficial to see if we can rewrite parts of the TFLite frontend to conform to that.

electriclilies · April 23, 2021, 9:33pm

Right, my point here is that the challenge you are encountering is that the quantization framework translates normal Relay ops into affine space (which sometimes is multiple Relay ops), and then you have to match the affine space version of the Relay op that the framework created, which is tricky. Really what you want to do is know what the QParams are and just offload the original Relay op without worrying about what the affine space version of the Relay op is.

I’m not sure what the best way to solve this is, though.

You could make a ton more symbolic QNN ops that store the QParams directly, but then you end up in a situation where you need to make QNN corresponding to most Relay ops, which doesn’t make a ton of sense.

Or we could do something like insert qnn.requantize ops and change the dtype of all the intermediate ops to be int8, and annotate all the intermediate ops with their QParams, so you could match those ops directly and offload them. This graph wouldn’t be correct because Relay ops like sigmoid wouldn’t take QParams into account, but it wouldn’t matter because you’d just replace them with your kernel, which does take the QParams into account.

AndrewZhaoLuo · April 23, 2021, 10:10pm

Lots of interesting thoughts here. Overall it seems the main pain point is that it’s really hard to match quantized operations to do BYOC or something similar. I do think a more “unified” relay representation is the way to do this and this work can certainly lay the foundation for that. Here are my thoughts on this:

I think a major issue with quantized operations vs. non-quantized operations in general is how much rounding matters. If you lose 1 out of 4 bits of information it can be really significant. Therefore, implementation details matter a lot more than FP32 case because they can change how rounding is done and therefore affect the semantics of the operation in a more meaningful way. As an example, we can imagine doing a quantized convolution and bias-add operation either by taking the accumulation buffer of the convolution and using that for the bias-add or downsampling the accumulation buffer to 8 bits and using that for the bias-add. Obviously the first one is preferable but maybe you have hardware which can only do the second. We therefore have to be able to support both in QNN.

The main point is that while conv2d represents a mathematical operation which is well defined, qnn.conv2d really needs to represent a family of mathematical operations each of which approximates conv2d in a different way. Right now what we’re running into I believe is the fact that qnn.conv2d is very specific and doesn’t provide enough knobs to change the semantics of the operations.

Keep in mind that I’m not familiar with a lot of the examples that @mbaret makes but it seems to me that a lot of these problematic patterns have to do with getting things to the correct input types for QNN ops. We can easily imagine a world where these QNN ops can take in really any input pattern and internally when things are lowered things are cast to the correct type. In the case of a conv2d we might imagine a conv2d-bias-add block with some sort of knobs exposed that might specify how the add after the conv2d is performed. We then wouldn’t have these scattered requantized, cast, etc. which might make the pattern matching for BYOC easier.

I know fused operator nodes aren’t really very relay-y but then again QNN isn’t normal relay since as mentioned before, QNN.conv2d needs to really represent a lot of different operations. The potential downside is having an explosion of potential fused operator nodes. However I argue that every fused operator node is just a case with special implementation details which we would have to deal with anyway.

Basically, it seems if we want nice pattern matching off the QNN graph, we have to avoid leaking implementation details to the QNN relay graph. We still need to specify these implementation details somewhere so we do so by adding new parameters to QNN or creating new QNN symbolic ops.

mikeseven · April 25, 2021, 5:38pm

I’d like to make sure the end goal of this framework is to create a fully quantized graph, ie with all operators in affine space.

Unlike the usual transformation contraint in TVM that graph rewrite doesn’t change outcome, for quantization, it obviously does. Statistics must be available to help answer how much.

From a BYOC point of view, some group of operators may be replaced by efficient hardware equivalent. For example, conv-add-relu. Also, math functions may be replaced by LUT.

The transformed graph is a simulated quantized graph that allows the user or the quantization framework to always simulate output and handle quantization error. I don’t think we need to provide all combinations but hooks should be in place to allow such custom, user defined, handling.

Finally, the proposal may be missing definition of accumulators in affine space. While weights, inputs (constant or dynamic) and outputs will be in affine space eg int8 dtype, it is important to be able to specify on which dtype intermediate math operations will be, for example int32. If we allow any kind of dtype, then the simulated quantized graph should be able to answer how many bits do I need before saturation. Again, I view such answers as part of statistics the user can analyze. At TIR level, such accumulators may lead to efficient, hardware dependent, transformations.

anijain2305 · April 26, 2021, 7:39am

I apologize for the long delay.

Thanks @electriclilies and team for nicely written RFC. I support the idea. Reading through the comments, it seems that many of us are in agreement about the AutoQ and its reliance on QNN extension. The mentioned pain points mostly revolve around

The inconsistency of QNN operators.
Wide variety of choices one can make while quantizing a conv2d.

Therefore, to strengthen the integration of AutoQ, QNN and BYOC, we need more consistency in QNN operators. And our auto-quantization algorithm needs to be flexible that it can support different forms of quantization even for the same operator (as @AndrewZhaoLuo mentioned).

The QNN operator inconsistency pain point is interesting and eye opening. I did not know that it was so painful from BYOC perspective. I think it is inevitable that PT/TFLite parsed quantized graphs will have some differences because of the differences in how frameworks support different operators. But, I agree that we must strive to keep it as consistent as possible. I like @masahi idea to add more QNN operators (using automatic code generation maybe) to support operators like resize, pool, relu, softmax.

A question for @electriclilies from the RFC

Extend qnn.conv2d, qnn.dense, etc. to be used with more datatypes, including fp32. We would also have to add an attribute to QNN specify the accumulation datatype used.

I am trying to understand why we need qnn.conv2d* (* represents operator along the lines of qnn.simulated_conv2d) during calibration. The only reason would be if you want to propagate the error from previous operators while calibrating current conv2d operator. If we calibrate in a manner that it does not account for the error introduced by quantizing previous operators (common in today’s frameworks), then we need only qnn.simulated_quantize and qnn.simulated_dequantize to calculate the quantization error at the current operator. Is my understanding correct? (Just trying to understand. I will buy the idea that propagating errors while calibration might be helpful for aggressive quantization.)

@electriclilies @mbaret This is somewhat tangential but I wanted to understand more. Suppose, we extend the qnn.conv2d to qnn.conv2d* that supports simulation during calibration. So, we have a pattern, qnn.simulated_quantize → qnn.conv2d* → qnn.simulated_dequantize. What are the input scales and zero points of qnn.conv2d*? IIUC, they should be equal to the qnn.simulated_quantize operator at the inputs of qnn.conv2d*. If that is true, once we finish calibration, can we use this graph for BYOC?

electriclilies · April 26, 2021, 7:04pm

@mikeseven Yes, the goal is to create a fully quantized graph, and we do recognize that this transformation will change the output of the graph. For this reason, we’re not going to present the rewrite as a Relay pass. And I definitely agree that we should let there be user-defined handling.

Also, we definitely have been thinking about simulating accumulation in affine space. For int8 input datatypes with int32 accumulation, simulating int32 accumulation is probably not super important since there’s a low likelihood of overflow. Therefore we’re hoping to deal with it in the multi-dtype extension. One option for doing this is creating another simulated QNN op that simulates overflow for a given dtype.

electriclilies · April 26, 2021, 7:39pm

We do want to support propogating error from previous operators while calibrating the current conv2d operator.

Additionally, since qnn.simulated_quantize does actually move the data into affine space, qnn.simulated_quantize -> nn.conv2d -> qnn.simulated_dequantize is actually incorrect, since nn.conv2d doesn’t take non-zero zero points into account. And, since we will eventually extend QNN to support multiple dtypes anyways, it’s not that much effort to add fp32 as a dtype.

I’m not sure I understand what you’re saying here. Like I said above, if we do simulated quantization instead of fake quantization, then we need to take zero points into account for every op that’s in affine space. Were you thinking we’d do something like this:

qnn.simulated_quantize -> qnn.simulated_dequantize -> nn.conv2d -> qnn.simulated_quantize -> qnn.simulated_dequantize.

(ie we’d use the simulated quantize ops do fake quantization?)

I think that yes, that graph could be used for BYOC if the BYOC people want. However, that graph will still have some ops in real space that the BYOC people would need to transform into affine space, whereas the output of our final rewrite will be completely in affine space.

It’s not clear to me whether it’s easier to transform real Relay ops into affine-space BYOC or affine-space Relay ops into BYOC.

mikeseven · April 26, 2021, 7:44pm

Thanks Lily. Agree

electriclilies · April 26, 2021, 8:57pm

Also, as part of the standardization of QNN, we could ensure that all QNN “compute” ops go from int8 -> int8 . I believe that qnn.conv2d is the only QNN op that outputs an accumulation dtype, so we could change qnn.conv2d to take in bias in addition to the data and weight.

anijain2305 · April 26, 2021, 10:04pm

Thanks, that makes sense. I was thinking that while calibration, you could use different attributes for simulated_quantize and simulated_dequantize ops. In the callback of calibrating an operator, one can simulate the affine space and argue about scales and zero points. But for capturing real values, you could use the passthrough feature of simulated ops to prevent any error. In this case, qnn.simulated_quantize (passthrough) → nn.conv2d → qnn.simualted_dequantize (passthrough) will work. But, I read your earlier RFC, and I think you are also maintaining the original graph to find the real tensor values without any error if needed. So, it makes sense to me.

mikeseven · April 27, 2021, 6:11pm

yes conv (1d/2d/3d)+bias is a typical quantized op with accumulator. And so is conv+bias+relu. Fully connected/matmul too.

AndrewZhaoLuo · April 30, 2021, 8:32pm

It might be best to actually relax the invariant on QNN ops from affine space → affine space to real space → real space.

It fits more in line with shoving implementation details into QNN.

Take add for example. How it’s implemented in the current QNN is we take in our input tensors and (re)quantize them to the output quantization parameters. If the way we quantize the inputs depends on the implementation of our operator sometimes then it makes sense to let QNN control quantization.

mikeseven · May 5, 2021, 5:09pm

@AndrewZhaoLuo excellent point. In quantization frameworks, there are few points of control. For example, at the operator itself (QNN ops) and also at the framework that controls operators.

At operator level, while the equations are the same, the calculations of the parameters may be different from one framework (e.g. TF vs PyTorch) to another, or some hardware may have more optimized ISA for some functions (e.g. uint8 vs int8, full range vs. restricted range, vectorization, …).

The choice at QNN op level has impact on the framework controlling them to minimize quantization error and to maximize performance across the model.

These controls are useful for preserving accuracy of pre-quantized models, fine tuning for specific devices, and full-blown quantization.

I think the key here is to provide statistics to let user-defined quantization strategy to decide how well the “invariant” must be preserved.

electriclilies · May 5, 2021, 6:50pm

@mikeseven @AndrewZhaoLuo I do think doing real_space -> real_space would be a better invariant for QNN. @mbrookhart and I were discussing the fact that when you use qnn.conv2d, the requantize that follows it needs to have an input scale that is dependent on the input scales to the qnn.conv2d.

Concretely, if you have qnn.conv2d(data, s1, zp1, s2, zp2), the requantize that follows it must be requantize(qnn.conv2d, s1 * s2, 0, new_scale, new_zp).

This makes it really easy to introduce errors during the graph rewrite and moreover is a headache because there are more things to keep track of during the rewrite…

I definitely need to think more about how this would change the structure of our rewrite and existing code. At a certain point, we will have to link QNN ops together so that we are only operating in affine space and there is no transition back to real space in between affine-space regions. It’s not clear to me how to do this without violating the invariant that QNN always goes from real space to real space.

mikeseven · May 7, 2021, 5:29pm

Please correct me if I’m wrong. The way I understand real to real invariant is that quantization operations are carried in fp32. If scale and zero point are also in fp32 then, all quantization operations being linear, no error, fully invariant. Once you start using lower bit representation, invariant doesn’t hold.

electriclilies · May 7, 2021, 6:23pm

@mikeseven I think the real to real invariant strictly refers to the inputs and outputs of the QNN ops, not what is happening inside. Specifically, we’re talking about whether we have scaled and shifted (regardless of dtype). So in the “real to real” invariant, unscaled data would come in to QNN ops with the quantization parameters, be quantized within the QNN op, the operator would do stuff in affine space, then the output would be scaled and shifted back into real space. Essentially we’d be pushing qnn.quantize and qnn.requantize into the qnn.conv2d, qnn.dense, etc.

However, after some offline discussion, @AndrewZhaoLuo and I are not actually sure if that is the best approach because it makes quantizing weights ahead of time difficult and also introduces complexity into how we link up the QNN ops.

Given the depth of the discussion here and the large search space, maybe we should have an RFC to discuss the changes to QNN and what makes the most sense.

PeiqinSun · October 8, 2022, 9:31am

Thanks for your great work! I have a question that has anyone converted an ONNX with Q/DQ to RFC directly? Maybe any suggestions are given to me?

kuladeepmarupalli · January 12, 2023, 8:47am

Hi @electriclilies,

Thanks for the detailed post. Are the remaining 3 parts of the RFC posted somewhere as well?

Thanks in advance!

masahi · January 12, 2023, 8:12pm

No, there is no update on this initiative.