[RFC][Quantization] A new quantization framework in TVM: initial RFC (1/4)

electriclilies · April 26, 2021, 7:39pm

We do want to support propogating error from previous operators while calibrating the current conv2d operator.

Additionally, since qnn.simulated_quantize does actually move the data into affine space, qnn.simulated_quantize -> nn.conv2d -> qnn.simulated_dequantize is actually incorrect, since nn.conv2d doesn’t take non-zero zero points into account. And, since we will eventually extend QNN to support multiple dtypes anyways, it’s not that much effort to add fp32 as a dtype.

I’m not sure I understand what you’re saying here. Like I said above, if we do simulated quantization instead of fake quantization, then we need to take zero points into account for every op that’s in affine space. Were you thinking we’d do something like this:

qnn.simulated_quantize -> qnn.simulated_dequantize -> nn.conv2d -> qnn.simulated_quantize -> qnn.simulated_dequantize.

(ie we’d use the simulated quantize ops do fake quantization?)

I think that yes, that graph could be used for BYOC if the BYOC people want. However, that graph will still have some ops in real space that the BYOC people would need to transform into affine space, whereas the output of our final rewrite will be completely in affine space.

It’s not clear to me whether it’s easier to transform real Relay ops into affine-space BYOC or affine-space Relay ops into BYOC.