[RFC][Quantization] A new quantization framework in TVM: initial RFC (1/4)

mbaret · April 21, 2021, 8:50pm

This looks like great work, thanks for the RFC!

I agree that it’s very valuable for there to be a stage in the Relay lowering where the ‘QNN-ness’ is explicit - we’ve got both hardware and performance libraries which accelerate quantized operators specifically.

One of the things that complicates our ability to match QNN operators though is the inconsistent way they’re represented. For instance, for QNN convolution we must match qnn.conv2d -> bias_add -> qnn.requantize whereas for something like sigmoid we must instead match qnn.dequantize -> sigmoid -> qnn.quantize. This broadly corresponds to the difference between QNN ops that have ‘native int8’ support and those which are faked through fp32.

So with regard to your suggestions about how we can do pattern-based rewriting, I wonder if we could consider a 2-stage rewrite. A first one which would turn convolution into ‘faked int8’ convolution (qnn.dequantize -> nn.conv2d -> qnn.quantize) and then a second pass which rewrites that into the proper int8 quantized convolution (skipping qnn.conv2d). The first form would be a good target for hardware off-loading and the second might avoid some of repetition you’ve described.

electriclilies · April 21, 2021, 11:51pm

@mbaret Thanks for the input!

As I understand it, you are proposing producing a fake quantized graph, which then can be used for calibration. Additionally, hardware vendors would be able to directly pattern match on this graph to do offloading to hardware targets. Finally, we’d have a pass to generate the final relay version of the graph from the original and/or the fake quantized graph. (Please correct me if I misunderstood!)

You said that you’d prefer matching qnn.dequantize -> nn.conv2d -> qnn.quantize over matching qnn.quantize -> qnn.conv2d -> bias_add -> qnn.requantize.

One question I have for you is what the exact pain point is in the inconsistency of how quantized graphs are represented. Is it the presence of multiple ops in the affine space? (Would qnn.dequantize -> nn.conv2d -> nn.bias_add -> qnn.quantize be OK?) Or is it that you have to deal with both qnn.requantize as well as qnn.dequantize and qnn.quantize? (i.e., matching on a graph that has no qnn.requantize ops in it would be better for you)

Also, I’m curious what you’re replacing the qnn.dequantize -> sigmoid -> qnn.quantize with. Are you moving the sigmoid into int8, and requantizing right after the sigmoid? Is the problem here that you are having to match and offload qnn.conv2d, qnn.dense, and other ops that are already in affine space, as well as dealing with ops that have not correctly been moved into affine space?

masahi · April 22, 2021, 1:10am

The inconsistency pain point is interesting. I had a similar problem with matching against dequantize → resize → quantize. I had to extract input qparams and output qparams from dequantize and quantize respectively, since there is no qparams attached to resize.

To improve on the current situation, I would rather enrich qnn ops to include those ops that are currently wrapped with dequantize/quantize, rather than completely skipping qnn.conv2d and qnn.requantize and make everything wrapped with dequantize/quantize. We can hide dequantize and quantize in the default QNN lowering path, and BYOC people can directly match against quantized sigmoid etc with qparams explicitly attached to it.

The pros of that approach would be:

We can keep the same representation and patterns for prequantized and auto-quantized cases
Patterns are simpler (no need to match against dequantize and quantize)
BYOC backend can directly extract qparams from the arguments of a target quantized op, rather than from surrounding dequantize and quantize.

The con would be more work on QNN, but I think adding a new q op that have previously been dealt with by dequantize and quantize would be a mostly mechanical process that could even be automated (e.g. with a macro). Moreover, I think the list of such ops is not big, sigmoid, softmax, resize, and maybe hswish.

electriclilies · April 22, 2021, 4:28am

@masahi I agree, I think that if there are not that many ops we need to do this for, adding them to QNN would be ideal.

I also think that we could deal with replacing qnn.dequantize -> sigmoid -> qnn.quantize with a qnn.sigmoid op fairly easily using the pattern matcher. We could probably build this into our final rewrite step easily. Additionally, directly adding the qparams to these ops is a good option because some of the QNN ops already require them.

One question I have, though, is whether we should be choosing qparams for these ops instead of using ones from the previous quantized pattern.

If we write qnn.resize, qnn.sigmoid, etc., then we can quantize these using the pattern-based rewrite, and choose qparams for these ops. So initially we’d rewrite to qnn.quantize -> qnn.sigmoid -> qnn.dequantize, choose the correct qparams, and then in the final graph, we’ll end up with qnn.sigmoid -> qnn.requantize, which is consistent with what happens after quantizing a convolution

masahi · April 22, 2021, 8:28am

Yes this sounds good, in that we can quantize sigmoid etc just like other ops like convolution. Probably we don’t need requantize for sigmoid and most other ops that are currently wrapped with dequantize / quantize.

mbaret · April 22, 2021, 9:42am

We have accelerators that directly support quantized int8 → int8 sigmoid with any requantization handled, so we can lift this full pattern and map it directly to a hardware operation.

I can perhaps elaborate about what I mean on consistency. qnn.conv2d behaves a bit differently to, say, qnn.add. For qnn.conv2d, the type of the function is int8 → int32 and the result is in some intermediate quantization space, whereas qnn.add is straightforwardly int8 → int8. Additionally, when qnn.conv2d was first added to TVM it didn’t include the input and weight scales separately as due to a mathematical quirk the only quantization parameter you needed was the input*weight scale. We can see this in the documentation which explains that the input/kernel scale were added just to help support accelerators:

    input_scale: tvm.relay.Expr
           The scale for the input tensor. The scale for the input tensor is
           stored purely for convenience here. See more commentary below.

    kernel_scale: tvm.relay.Expr
           The scale for the weight tensor. The scale for the weight tensor is
           stored for access to this during relay. This information is not
           needed in the pass pipeline after qnn.conv2d is lowered to the
           sequence of steps as in nn.conv2d. See also input_scale in Requantize.

This would have been simpler to reason about with ‘fake quantization’ where to determine the quantization parameters of any of the inputs/outputs we can just visit the appropriate quantize/dequantize op and read off the QNN params in a ‘unified’ way (i.e. we don’t need to have different ways of extracting QNN information for every operator).

On the second point, we’ve recently been going through an exercise of trying to come up with patterns to match all the various quantized operators and it’s been pretty painful. Aside from the 3 conventions already discussed (int8->int8 QNN ops like qnn.add, int8->int32 QNN ops like qnn.conv2d and fake QNN ops like sigmoid) there are also other interesting patterns like:

avg_pool2d gets a cast before and after
mean becomes cast → mean → qnn.requantize
some ops do nothing at all (pad/max/min)

So to a degree here we are at the mercy of the authors of the framework frontend as to how they choose to express quantization. On the other hand, if the frontend simply inserted dequantize/quantize where ever it saw a quantized tensor in TFLite we’d have a very consistent and hopefully more stable representation to match and offload against. Clearly though there’s a downside to this in increasing the complexity of any subsequent QNN lowering pass.

Apologies for the wall-of-text Having said all this I think if we can ‘standardize’ the QNN ops of Relay and ensure broad coverage that would probably provide similar benefit. The most valuable thing for pattern matching is just that there exists a canonical representation of QNN, so switching to either quantize/dequantize or QNN ops would be an improvement for us.

elenkalda-arm · April 22, 2021, 2:17pm

+1 for a standarized stable Relay representation of QNN operators! From the hardware accelerators point of view, we have the problem that every frontend lowers the quantized models into Relay however they like, so as an example, we would need different patterns for Relay from TFLite and from PyTorch. So some sort of stable representation would be really useful indeed.

It is not entirely clear to me what’s the position of this RFC on this, do you plan to have a stage where the Relay would resemble to the Relay e.g. TFLite frontend generates (so it could be picked up by current pattern matchers) or to add a completely different Relay representation for auto-quantization or to change the way frontends currently lower to Relay to match the auto-quantization Relay?

(Btw, very well written RFC on a very complex topic!)

mikeseven · April 22, 2021, 4:54pm

Thank you for this very well written proposal. I’m fully in support of your proposal and have been working on similar ideas. From an implementation point of view, the framework should be flexible enough to handle op specific quantization patterns. For example, various modes of a conv2d may need different quantization schemes. Some hardware may benefit from different quantization schemes due to more efficient instructions.

As the proposal mentions there are many ways to minimize quantization error in the litterature. We should design this framework with that scalability in mind.

It should also be possible to get statistics.

mbaret · April 22, 2021, 6:57pm

I agree this is quite important. To add a specific case that’s causing us some trouble at the moment, qnn.requantize is implemented differently in Relay to the quantization scheme in TFLite and therefore we can get quite different results running a network through TVM vs. TFLite. For a single operator this tends to just be an error of +/- 1 from rounding mode differences but propagated through a large network we’ve seen these errors add up.

I think either we need qnn.requantize to be ‘configurable’ to support the different numerical behaviour of different frontends or potentially have different requantize ops entirely.

electriclilies · April 22, 2021, 7:46pm

I haven’t taken a close look at what the TFLite / Pytorch importers do. I agree that it does seem suboptimal that they are creating different patterns, but I don’t think that rewriting these importers is within the scope of this work. However, I can try to create more consistency within this framework, and then hopefully this consistency can be extended to the importers in the future.

Regarding this:

QNN ops are symbolic, so the actual Relay code that gets run is specified in Canonicalize functions (qnn.requantize also has a Lower function attached to it, and I’m not clear on when the Canonicalize function is used and when the Lower function is used in this case…)

We’ve thought about moving the Canonicalize/ Lower functions to Python… Then, it would be easier to change these backends in the future. And we could maybe have mulitple canonicalization functions for emulating different backends. I’m not sure how dispatching to these different functions would work, though.

I think that the pain point you are describing is a fundamental problem with doing multiple pattern-based rewrites in a row. The first rewrite is a mapping from pattern to AST. Then when you try to do the second rewrite, you need to write a new pattern that matches all the ASTs produced from the first rewrite. You might be able to generate a pattern from the code that produces the AST… but that’s a pretty complex program analysis / synthesis problem.

A potential solution is you could ingest the graph before we’ve inserted qnn.requantize – so basically, we will have only transformed specific patterns into affine space (like qnn.conv2d -> nn.bias_add), and we could guarantee stability of the AST of a certain number of these patterns, so you won’t need to write that many.

Additionally, we won’t have transformed any of the other ops to run correctly in affine space at this point, so you could just offload those directly instead of trying to match the AST they’ve been transformed into. (This doesn’t solve the problem of knowing what the qparams are for these intermediate ops, though… I need to think about that more.)

masahi · April 22, 2021, 8:24pm

Yes, I feel your pain, having worked on supporting both PT and TFLite for a BYOC backend. I’d say the difference in PT- or TFLite-derived QNN graphs are inevitable to some extent, due to the different ways quantization work and how quantized operators are implemented in each framework.

I think this point is one of the reasons auto quantization in TVM could be interesting for BYOC people that are already using QNN for prequantized models. We get QNN graphs consistently for each framework models and do not have to deal with idiosyncrasy of different ways quantziation is done in each framework.

It is not clear how auto-quantized QNN graphs are going to look like, hopefully BYOC people can reuse the same patterns that they already have for PT or TFLite.

masahi · April 22, 2021, 8:13pm

I think this is orthogonal to this work, we can separately discuss e.g. adding a new attribute to requantize to support different modes (fixed point vs float math etc) @anijain2305

mikeseven · April 22, 2021, 10:23pm

mbaret:

mikeseven:

For example, various modes of a conv2d may need different quantization schemes. Some hardware may benefit from different quantization schemes due to more efficient instructions.

I agree this is quite important. To add a specific case that’s causing us some trouble at the moment, qnn.requantize is implemented differently in Relay to the quantization scheme in TFLite and therefore we can get quite different results running a network through TVM vs. TFLite. For a single operator this tends to just be an error of +/- 1 from rounding mode differences but propagated through a large network we’ve seen these errors add up.

I think either we need qnn.requantize to be ‘configurable’ to support the different numerical behaviour of different frontends or potentially have different requantize ops entirely.

Yes exactly! I feel your pain too Some frameworks use full range eg [-128,127], others restricted [-127,127]. Even at operator level, using per-channel or per-layer is sometimes not sufficient. A point convolution should not deal with channels the same way a generic conv do. Likewise for a conv with groups etc. or even more fun with fusing/splitting operators.

So the framework should be flexible to use such dedicated quantizers if need be, maybe through some kind of pattern matching.

I like to use the video codec analogy: there are many ways to encode a video but you must have one way for any player to play it back.

In that sense, I see qnn ops as the decoder part, and this framework flexible enough to allow various encoding schemes over time. For validation, we could start with reproducing say TFlite way but we should not be limited by it (because TFlite is very very limited).

mbaret · April 23, 2021, 4:17pm

Agreed, but it would be nice to agree a ‘TVM’ standard way to represent quantization at various levels. That way others (maybe even me ) can start applying that standard to the frontends. It would also make it easier to get accelerators to work with TVM’s auto-quantization.

We need to be a bit careful doing this, because it’s the frontend behaviour that’s different rather than the backend. So any such differing in canonicalization should correspond to an attribute on qnn.requantize that determines the desired numerical behaviour.

I think this is one of the major complications for quantized accelerators - we require that everything is in the affine space. That is to say, we don’t have an ‘integer’ accelerator but specifically a quantized one. So even things like sigmoid which seems strange to run in the affine space at least need to be ‘fake quantized’ with dequantize/quantize so we have access to those affine parameters. TFLite does this relatively neatly by having the quantization parameters as part of the tensor type but unfortunately we can’t do the same thing in TVM.

I’m also hoping for this If we can arrive at a durable and flexible representation for auto-quantization I think it would even be beneficial to see if we can rewrite parts of the TFLite frontend to conform to that.

electriclilies · April 23, 2021, 9:33pm

Right, my point here is that the challenge you are encountering is that the quantization framework translates normal Relay ops into affine space (which sometimes is multiple Relay ops), and then you have to match the affine space version of the Relay op that the framework created, which is tricky. Really what you want to do is know what the QParams are and just offload the original Relay op without worrying about what the affine space version of the Relay op is.

I’m not sure what the best way to solve this is, though.

You could make a ton more symbolic QNN ops that store the QParams directly, but then you end up in a situation where you need to make QNN corresponding to most Relay ops, which doesn’t make a ton of sense.

Or we could do something like insert qnn.requantize ops and change the dtype of all the intermediate ops to be int8, and annotate all the intermediate ops with their QParams, so you could match those ops directly and offload them. This graph wouldn’t be correct because Relay ops like sigmoid wouldn’t take QParams into account, but it wouldn’t matter because you’d just replace them with your kernel, which does take the QParams into account.

AndrewZhaoLuo · April 23, 2021, 10:10pm

Lots of interesting thoughts here. Overall it seems the main pain point is that it’s really hard to match quantized operations to do BYOC or something similar. I do think a more “unified” relay representation is the way to do this and this work can certainly lay the foundation for that. Here are my thoughts on this:

I think a major issue with quantized operations vs. non-quantized operations in general is how much rounding matters. If you lose 1 out of 4 bits of information it can be really significant. Therefore, implementation details matter a lot more than FP32 case because they can change how rounding is done and therefore affect the semantics of the operation in a more meaningful way. As an example, we can imagine doing a quantized convolution and bias-add operation either by taking the accumulation buffer of the convolution and using that for the bias-add or downsampling the accumulation buffer to 8 bits and using that for the bias-add. Obviously the first one is preferable but maybe you have hardware which can only do the second. We therefore have to be able to support both in QNN.

The main point is that while conv2d represents a mathematical operation which is well defined, qnn.conv2d really needs to represent a family of mathematical operations each of which approximates conv2d in a different way. Right now what we’re running into I believe is the fact that qnn.conv2d is very specific and doesn’t provide enough knobs to change the semantics of the operations.

Keep in mind that I’m not familiar with a lot of the examples that @mbaret makes but it seems to me that a lot of these problematic patterns have to do with getting things to the correct input types for QNN ops. We can easily imagine a world where these QNN ops can take in really any input pattern and internally when things are lowered things are cast to the correct type. In the case of a conv2d we might imagine a conv2d-bias-add block with some sort of knobs exposed that might specify how the add after the conv2d is performed. We then wouldn’t have these scattered requantized, cast, etc. which might make the pattern matching for BYOC easier.

I know fused operator nodes aren’t really very relay-y but then again QNN isn’t normal relay since as mentioned before, QNN.conv2d needs to really represent a lot of different operations. The potential downside is having an explosion of potential fused operator nodes. However I argue that every fused operator node is just a case with special implementation details which we would have to deal with anyway.

Basically, it seems if we want nice pattern matching off the QNN graph, we have to avoid leaking implementation details to the QNN relay graph. We still need to specify these implementation details somewhere so we do so by adding new parameters to QNN or creating new QNN symbolic ops.

mikeseven · April 25, 2021, 5:38pm

I’d like to make sure the end goal of this framework is to create a fully quantized graph, ie with all operators in affine space.

Unlike the usual transformation contraint in TVM that graph rewrite doesn’t change outcome, for quantization, it obviously does. Statistics must be available to help answer how much.

From a BYOC point of view, some group of operators may be replaced by efficient hardware equivalent. For example, conv-add-relu. Also, math functions may be replaced by LUT.

The transformed graph is a simulated quantized graph that allows the user or the quantization framework to always simulate output and handle quantization error. I don’t think we need to provide all combinations but hooks should be in place to allow such custom, user defined, handling.

Finally, the proposal may be missing definition of accumulators in affine space. While weights, inputs (constant or dynamic) and outputs will be in affine space eg int8 dtype, it is important to be able to specify on which dtype intermediate math operations will be, for example int32. If we allow any kind of dtype, then the simulated quantized graph should be able to answer how many bits do I need before saturation. Again, I view such answers as part of statistics the user can analyze. At TIR level, such accumulators may lead to efficient, hardware dependent, transformations.

anijain2305 · April 26, 2021, 7:39am

I apologize for the long delay.

Thanks @electriclilies and team for nicely written RFC. I support the idea. Reading through the comments, it seems that many of us are in agreement about the AutoQ and its reliance on QNN extension. The mentioned pain points mostly revolve around

The inconsistency of QNN operators.
Wide variety of choices one can make while quantizing a conv2d.

Therefore, to strengthen the integration of AutoQ, QNN and BYOC, we need more consistency in QNN operators. And our auto-quantization algorithm needs to be flexible that it can support different forms of quantization even for the same operator (as @AndrewZhaoLuo mentioned).

The QNN operator inconsistency pain point is interesting and eye opening. I did not know that it was so painful from BYOC perspective. I think it is inevitable that PT/TFLite parsed quantized graphs will have some differences because of the differences in how frameworks support different operators. But, I agree that we must strive to keep it as consistent as possible. I like @masahi idea to add more QNN operators (using automatic code generation maybe) to support operators like resize, pool, relu, softmax.

A question for @electriclilies from the RFC

Extend qnn.conv2d, qnn.dense, etc. to be used with more datatypes, including fp32. We would also have to add an attribute to QNN specify the accumulation datatype used.

I am trying to understand why we need qnn.conv2d* (* represents operator along the lines of qnn.simulated_conv2d) during calibration. The only reason would be if you want to propagate the error from previous operators while calibrating current conv2d operator. If we calibrate in a manner that it does not account for the error introduced by quantizing previous operators (common in today’s frameworks), then we need only qnn.simulated_quantize and qnn.simulated_dequantize to calculate the quantization error at the current operator. Is my understanding correct? (Just trying to understand. I will buy the idea that propagating errors while calibration might be helpful for aggressive quantization.)

@electriclilies @mbaret This is somewhat tangential but I wanted to understand more. Suppose, we extend the qnn.conv2d to qnn.conv2d* that supports simulation during calibration. So, we have a pattern, qnn.simulated_quantize → qnn.conv2d* → qnn.simulated_dequantize. What are the input scales and zero points of qnn.conv2d*? IIUC, they should be equal to the qnn.simulated_quantize operator at the inputs of qnn.conv2d*. If that is true, once we finish calibration, can we use this graph for BYOC?

electriclilies · April 26, 2021, 7:04pm

@mikeseven Yes, the goal is to create a fully quantized graph, and we do recognize that this transformation will change the output of the graph. For this reason, we’re not going to present the rewrite as a Relay pass. And I definitely agree that we should let there be user-defined handling.

Also, we definitely have been thinking about simulating accumulation in affine space. For int8 input datatypes with int32 accumulation, simulating int32 accumulation is probably not super important since there’s a low likelihood of overflow. Therefore we’re hoping to deal with it in the multi-dtype extension. One option for doing this is creating another simulated QNN op that simulates overflow for a given dtype.

electriclilies · April 26, 2021, 7:39pm

We do want to support propogating error from previous operators while calibrating the current conv2d operator.

Additionally, since qnn.simulated_quantize does actually move the data into affine space, qnn.simulated_quantize -> nn.conv2d -> qnn.simulated_dequantize is actually incorrect, since nn.conv2d doesn’t take non-zero zero points into account. And, since we will eventually extend QNN to support multiple dtypes anyways, it’s not that much effort to add fp32 as a dtype.

I’m not sure I understand what you’re saying here. Like I said above, if we do simulated quantization instead of fake quantization, then we need to take zero points into account for every op that’s in affine space. Were you thinking we’d do something like this:

qnn.simulated_quantize -> qnn.simulated_dequantize -> nn.conv2d -> qnn.simulated_quantize -> qnn.simulated_dequantize.

(ie we’d use the simulated quantize ops do fake quantization?)

I think that yes, that graph could be used for BYOC if the BYOC people want. However, that graph will still have some ops in real space that the BYOC people would need to transform into affine space, whereas the output of our final rewrite will be completely in affine space.

It’s not clear to me whether it’s easier to transform real Relay ops into affine-space BYOC or affine-space Relay ops into BYOC.