We have accelerators that directly support quantized int8 → int8 sigmoid with any requantization handled, so we can lift this full pattern and map it directly to a hardware operation.
I can perhaps elaborate about what I mean on consistency. qnn.conv2d
behaves a bit differently to, say, qnn.add
. For qnn.conv2d
, the type of the function is int8 → int32 and the result is in some intermediate quantization space, whereas qnn.add
is straightforwardly int8 → int8. Additionally, when qnn.conv2d
was first added to TVM it didn’t include the input and weight scales separately as due to a mathematical quirk the only quantization parameter you needed was the input*weight scale. We can see this in the documentation which explains that the input/kernel scale were added just to help support accelerators:
input_scale: tvm.relay.Expr
The scale for the input tensor. The scale for the input tensor is
stored purely for convenience here. See more commentary below.
kernel_scale: tvm.relay.Expr
The scale for the weight tensor. The scale for the weight tensor is
stored for access to this during relay. This information is not
needed in the pass pipeline after qnn.conv2d is lowered to the
sequence of steps as in nn.conv2d. See also input_scale in Requantize.
This would have been simpler to reason about with ‘fake quantization’ where to determine the quantization parameters of any of the inputs/outputs we can just visit the appropriate quantize/dequantize op and read off the QNN params in a ‘unified’ way (i.e. we don’t need to have different ways of extracting QNN information for every operator).
On the second point, we’ve recently been going through an exercise of trying to come up with patterns to match all the various quantized operators and it’s been pretty painful. Aside from the 3 conventions already discussed (int8->int8 QNN ops like qnn.add, int8->int32 QNN ops like qnn.conv2d and fake QNN ops like sigmoid) there are also other interesting patterns like:
- avg_pool2d gets a cast before and after
- mean becomes cast → mean → qnn.requantize
- some ops do nothing at all (pad/max/min)
So to a degree here we are at the mercy of the authors of the framework frontend as to how they choose to express quantization. On the other hand, if the frontend simply inserted dequantize/quantize where ever it saw a quantized tensor in TFLite we’d have a very consistent and hopefully more stable representation to match and offload against. Clearly though there’s a downside to this in increasing the complexity of any subsequent QNN lowering pass.
Apologies for the wall-of-text Having said all this I think if we can ‘standardize’ the QNN ops of Relay and ensure broad coverage that would probably provide similar benefit. The most valuable thing for pattern matching is just that there exists a canonical representation of QNN, so switching to either quantize
/dequantize
or QNN ops would be an improvement for us.