[RFC][Quantization] Quantization in TVM

electriclilies · February 18, 2021, 9:28pm

Quantization in TVM

(This RFC corresponds to a PR #7474)

The goal of this work is to create a flexible and extensible framework for quantizing and calibrating models. Specifically, I want to

Allow arbitrary patterns to be rewritten to a corresponding quantized pattern
Support different, data-aware calibration methods, and allow new ones to be implemented easily
Easily be able to accommodate quantization to new datatypes in the future

I have broken the workflow down into three steps, quantization, calibration and requantization.

In quantization, I identify patterns in the original model that we want to quantize, and replace them with a quantized version of that pattern. I set the scale and zero points in qnn ops to relay variables, which will be set in calibration.

In calibration, I provide a callback through which users can set the scale and zero point variables to values, and run intermediate parts of the graph with real inputs to support data-aware calibration.

In requantization, I remove extraneous qnn.quantize and qnn.dequantize ops, and replace them with qnn.requantize. In calibration, I don’t insert any qnn.requantize ops because qnn.requantize requires scales and zero points to be constant scalars, not expressions, and postponing inserting qnn.requantize ops later allows quantization to be more modular. More on this in the Requantization section.

Quantization

In quantization, I use existing qnn ops to construct a quantized version of the graph.

There are two main classes involved in quantization: QuantizerPattern, a subclass of DFPatternCallback, and Quantizer. (DFPatternCallback finds specific patterns in a relay function, and transforms them using the pattern matcher).

The QuantizerPattern contains the pattern that we want to quantize, and also implements the callback method from the DFPatternCallback class to rewrite that pattern. For example, for the Conv2DPattern class rewrites

E0

fn (data, weight) {
	%0 = nn.conv2d(data, weight)
}

as

E1

fn (data, weight, scale_var_0, zp_var_0, scale_var_1, zp_var_1) {
	%0 = qnn.quantize(data, scale_var_0, zp_var_0)
	
	%1 = qnn.quantize(weight, scale_var_1, zp_var_1)
	
	%2 = qnn.conv2d(%0, %1, zp_var_0, zp_var_1, scale_var_0, scale_var_1)
	
	%3 = qnn.dequantize(%2, scale_var_0 * scale_var_1, relay.const(0, dtype='int32'))
}

Here is a shorter version of Conv2DPattern, the QuantizerPattern that does this transformation:

E2

class Conv2DPattern(QuantizerPattern):
		def __init__(self, calibration_callback):
			  self.calibration_callback = calibration_callback
				super().__init__(calibration_callback)
		    self.input = wildcard()
		    self.conv_weight = wildcard()
		    self.inputs = [self.input, self.conv_weight]
		    self.conv2d = is_op("nn.conv2d")(self.input, self.conv_weight)
		    self.pattern = self.conv2d
				self.attrs = None
		    self.weight_channel_axis = None
		    self.data_channel_axis = None
				self.channels = None

	def callback(self, pre, post, node_map):
        self.args = [node_map[i][0] for i in self.inputs]
        conv2d = node_map[self.conv2d][0]

        self.out_dtype = conv2d.checked_type.dtype

        self.get_attrs(conv2d.attrs, infer_type(self.args[1]).checked_type.shape)

        self.create_scale_zps("conv2d_data", "conv2d_weight")
        self.quantize_args()

        conv_scale = self.scale_zps[0] * self.scale_zps[2]  # data_scale * weight_scale

        # Conv zp is zero since QNN deals with input zps for us
        conv_zp = relay.const(0, dtype="int32")
        # args = [quantized_data, quantized_weight, data_zp, weight_zp, data_scale, weight_scale]
        args = self.quantized_args[0:2] + [self.scale_zps[i] for i in [1, 3, 0, 2]]

        if self.padding is not None:

            top, left, bottom, right = [p.value for p in get_pad_tuple2d(self.padding)]
            if self.kernel_layout == "OIHW":
                pad_width = ((0, 0), (0, 0), (top, bottom), (left, right))
            elif self.kernel_layout == "HWIO":
                pad_width = (
                    (top, bottom),
                    (left, right),
                    (0, 0),
                    (0, 0),
                )
            pad_val = 0
            args[0] = relay.op.nn.pad(args[0], pad_width, pad_val)

        # Construct quantized qnn.conv2d and dequantize
        qnn_call = self.create_conv(args)
        dequantized_call = relay.qnn.op.dequantize(
            qnn_call, conv_scale, conv_zp, out_dtype=self.out_dtype, axis=self.data_channel_axis
        )

        return dequantized_call

def quantize_args(self):
        """Helper to quantize the arguments to the qnn.conv2d."""
        quantized_data = relay.qnn.op.quantize(
            self.args[0], self.scale_zps[0], self.scale_zps[1], axis=self.data_channel_axis
        )
        quantized_weight = relay.qnn.op.quantize(
            self.args[1], self.scale_zps[2], self.scale_zps[3], axis=self.weight_channel_axis
        )
        self.quantized_args = [quantized_data, quantized_weight]

    def create_conv(self, args):
        """Creates the qnn.conv2d.

        Parameters
        ----------
        args : List[relay.Expr]
            Quantized arguments for the qnn.conv2d.

        Returns
        -------
        q_conv2d : relay.Expr
            Quantized version of the pattern.
        """
        return relay.qnn.op.conv2d(*args, **self.attrs)

def get_kernel_size(self, kernel_shape, kernel_layout):
		"""Body omitted for brevity, gets the kernel size"""
	  pass

def get_attrs(self, attrs, kernel_shape):
		"""Body omitted for brevity, constructs attrs for qnn.conv2d"""
		pass

There is a QuantizerPattern for every pattern we want to quantize. The patterns we currently support are Conv2DPattern, Conv2DBiasAddPattern, DensePattern, AddPattern, and MultiplyPattern, but it is easy to add your own if you wish to support a different pattern.

The Quantizer takes in the function to quantize, the parameters of the function, and a list of QuantizerPatterns. Let’s say we only want to quantize conv2d ops and dense ops. Then, we could create a Quantizer like this:

E3

quantizer = Quantizer(func, params, [Conv2DPattern(), DensePattern()])

Internally, the quantizer pattern first partitions the graph into functions containing each pattern, then rewrites the patterns to be quantized. It also constructs two functions which return tuples containing a lot of intermediate subgraphs, and stores indices mapping specific scale and zero point variables to these subgraphs, so they can be run and used in data-aware calibration.

For example, to pick values for scale_var_0, zp_var_0, scale_var_1 and zp_var_1 in E1, we might want to look at the values of %data, %weight and %0 (the output of the nn.conv2d0 in E0, as well as the values of %0 (the quantized data), %1 (the quantized weight) and %3 (the result of qnn.conv2d after converting back to float32) in E1. Relay doesn’t give us a good way to access intermediate values in functions, so I put all these values into tuples and return the tuple as the output of the function. For the original function in E0, we create a function whose output is (data, weight, %0) and for the quantized function in E1, we create a function whose output is (%0, %1, %3). For longer functions, the tuple would be a lot longer. For each pattern matched in the graph, we also store indices into the tuple so that we can extract the useful values during calibration.

These functions never have to built or indexed into by users. Utility functions in the calibrater do this automatically (more on this in the next section).

Calibration

Calibration involves four classes: QuantizerPattern, CalibrationInfo, CalibrationCallback, and Calibrator

Each QuantizerPattern has a method, calibrate_pattern, which is used during calibration to pick scale and zero point values.

calibrate_pattern returns a map of the names of scales and zero point variables to the value we are setting them as.

It takes CalibrationInfo as an argument. CalibrationInfo contains the names of scale and zero points variables for every qnn.quantize in the pattern in pairs: [(scale_var_1, zp_var_1), (scale_var_2, zp_var_2)]. CalibrationInfo also exposes the intermediate values in the graph through the methods get_unquantized_layer_inputs, get_unquantized_layer_outputs, get_quantized_layer_inputs, and get_quantized_layer_outputs. Each of these functions take an input to the original function, runs the quantized or unquantized function, and returns values corresponding AST nodes in the pattern.

For example, for the CalibrationInfo object corresponding to the pattern in E0 and E1, get_unquantized_layer_inputs returns values corresponding to [data, weight], and get_unquantized_layer_output returns the value corresponding to %0 in E0. get_quantized_layer_inputs returns values corresponding to [%0, %1] in E1, and get_quantized_layer_outputs returns %3 in E1.

The CalibrationInfo object also optionally contains a DatasetManager. The DatasetManager is a simple wrapper class for exposing datasets from other ML frameworks to the Calibrator in a unified way. For example, there is a TFDatasetManager, which wraps tensor flow dataset:

E4

class TFDatasetManager(DatasetManager):
    """DatasetManager wrapping a tensorflow dataset."""

    def __init__(self, tf_dataset, batch_size, total_batches):
        self.idx = 0
        self.total_batches = total_batches
        self.batch_size = batch_size
        self.tf_dataset = tf_dataset
        self.tf_iter = iter(self.tf_dataset)

    def get_next_batch(self):
        if self.is_empty():
            raise IndexError
        self.idx += 1

        data, label = next(self.tf_iter)

        return [data.numpy()], label.numpy()

    def num_batches(self):
        return self.total_batches

    def batch_size(self):
        return self.batch_size

    def is_empty(self):
        return self.idx >= self.total_batches

    def reset(self):
        self.tf_iter = iter(self.tf_dataset)
        self.idx = 0

Inputs from the DatasetManager can be passed to get_quantized_layer_inputs, get_quantized_layer_outputs, get_unquantized_layer_inputs, and get_unquantized_layer_outputs.

Let’s look at writing a data-aware calibrate_pattern for MyConv2DPattern. We’ll use the DatasetManager to get inputs from the original function, and pass them to get_unquantized_layer_inputs to get the data and weight for the Conv2D op.

E5

class MyConv2dPattern(Conv2DPattern):

	def calibrate_pattern(self, calibration_info):
		scale_zp_values = {}
		
		# Get an input to the original graph
		inputs = calibration_info.dataset_manager.get_next_batch()
	
		# Run the original function with the inputs and get values for data and weight in this pattern
		data_value, weight_value = calibration_info.get_unquantized_layer_inputs(inputs)
	
		# calibration_info.input_scale_zps = [[data_scale, data_zp], [weight_scale, weight_zp]]
		data_scale_name = calibration_info.input_scale_zps[0][0].name_hint
		data_scale = np.max(data_value) / 128
	
		scale_zp_values[data_scale_name] = data_scale
		
		# ...
	        # Set all the other scales and zero points
		# ...
		# scale_zp_values would look something like {'data_scale': 0.02, 'data_zp': 0, 'weight_scale': 0.05, 'weight_zp': 0.1}
		return scale_zp_values

Note: In E5 (and E8) I only use one input from the DatasetManger for the sake of simplicity, however in most data-aware algorithms, we will use many different inputs to the graph to calculate a lot of intermediate values, which will be used to calculate scales and zero points.

Being able to write pattern specific calibrate_pattern methods gives us more flexibility in constructing scales and zero points. To create per channel scales, we need to know the number of channels a Conv2D op has, and the number of units a Dense op has.

In E5, however, we’re not actually using any pattern specific information. We’ve written calibrate_pattern assuming that there are two values that are being quantized, data and weight, and two corresponding qnn.quantize ops. This is true for the Conv2DPattern (see E1), the DensePattern, and any other binop we want to quantize. If we want to implement the same method on the DensePattern, we would have to copy Conv2D’s calibrate pattern into DensePattern.

To reduce code reuse, we define a class called CalibrationCallback, which also has a method called calibrate_pattern, Each QuantizerPattern optionally takes in a CalibrationCallback as an argument, and its calibrate_pattern calls the calibrate_pattern of the CalibrationCallback: E6

class QuantizerPattern(DFPatternCallback):
	def __init__(self, calibration_callback):
	   super().__init__()
	   self.calibration_callback = calibration_callback
	def calibrate_pattern(self, calibration_info):
	   return self.calibration_callback.calibrate_pattern(calibration_info)

So, if we don’t overwrite the QuantizerPattern’s calibrate_pattern method, we’ll call the calibrate_pattern method of CalibrationCallback that is passed in. Let’s take a look at what using CalibrationCallbacks looks like:

E7

cc = MyCalibrationCallback()
conv2d_pattern = Conv2DPattern(cc)
dense_pattern = DensePattern(cc)
quantizer = Quantizer(func, params, [conv2d_pattern, dense_pattern])

Now let’s implement MyCalibrationCallback. This time, we’ll write the calibrate_pattern method to be generic so that it supports any number of qnn.quantize ops in the pattern, and can be passed to any QuantizerPattern:

E8

class MyCalibrationCallback(CalibrationCallback):
	def __init__(self, dataset_manager):
		self.dataset_manager = dataset_manager
	
	def calibrate_pattern(self, calibration_info):
		scale_zp_values = {}
		
		inputs = calibration_info.dataset_manager.get_next_batch()
	
		# quantized_values = [quantized_value_1, quantized_value_2, ... quantize_value_n]
		quantized_values = calibration_info.get_unquantized_layer_inputs(inputs)
		
	  # calibration_info.input_scale_zps = [[scale_var_1, zp_var_1], [scale_var_2, zp_var_2], .., [scale_var_n, zp_var_n]]
		for i in range(len(calibration_info.input_scale_zps)):
			scale_name = calibration_info.input_scale_zps[i][0].name_hint
			zp_name = calibration_info.input_scale_zps[i][1].name_hint
			
			# Calculate simple scale and zero point values
			scale_zp_values[scale_name] = np.max[quantized_values[i]] / 128
			scale_zp_values[zp_name] = np.mean[quantized_values[i]] / 128
	
	  return scale_zp_values

The Calibrator class manages calibration at a high level, and calls calibrate_pattern. It maintains a list of all the scales and zero point values returned from calibrate_pattern also updates the CalibrationInfo that is passed to calibrate_pattern. It takes in a Quantizer as an argument, since it needs to access information from the Quantizer. It also optionally takes a DatasetManager, which is passed to the calibrate_pattern function through the CalibrationInfo object.

So, to calibrate a function completely, all you need to do is construct a Calibrator, and call the method calibrate:

E9

cc = MyCalibrationCallback()
conv2d_pattern = Conv2DPattern(cc)
dense_pattern = DensePattern(cc)
quantizer = Quantizer(func, params, [conv2d_pattern, dense_pattern])
calibrator = QuantizationCalibrator(quantizer)
calibrated_func = calibrator.calibrate()

Requantization

qnn.requantize takes an int8 value, some scale and zero points, and transforms it into another int8 value with different scale and zero points, without going back to float32. To get a fast quantized workload, we want to stay in int8 for as long as possible. In the quantization step, we don’t use any qnn.requantize ops. We only use qnn.quantize and qnn.dequantize in that step for three reasons:

qnn.requantize requires scales and zero points to be constants, not relay expressions. We can’t allow the scales and zero points to be expressions without sacrificing performance.
If we were to directly introduce qnn.requantize during the quantization step, we would not be able to quantize each pattern individually because qnn.requantize requires scale and zero point values from the next pattern.
For quantization methods like KL-divergence, it is useful to have access to output value of the quantized layer, so we can compare it directly to the original output value. For example, we want to be able to compare %3 from E1, and compare it to %0 from E0, and adjust our scales and zero points so that the values are as close as possible. If %3 in E1 were a qnn.requantize instead of a qnn.dequantize, we could only compare the output of the qnn.requantize, which quantized, so the comparison is not useful.

Here’s how we requantize:

E10

cc = MyCalibrationCallback()
conv2d_pattern = Conv2DPattern(cc)
dense_pattern = DensePattern(cc)
quantizer = Quantizer(func, params, [conv2d_pattern, dense_pattern])
calibrator = QuantizationCalibrator(quantizer)
calibrated_func = calibrator.calibrate()
requantized_func = Requantizer().requantize(calibrated_func)

End use

Since this framework is designed to be flexible and modular, there are a lot of different parts that an end user probably does not want to deal with. We provide a Relay function transformation pass that wraps quantization, calibration and requantization together. The user only has to specify the QuantizerPatterns they want to use.

However, advanced users can call the workflow directly, or combine different parts of the workflow to create new relay function passes.

Output on a simple MNIST graph

Let’s look at calibrating as small MNIST graph:

E11

cc = AverageMaxCalibrationCallback()
quantizer = Quantizer(mnist_func, params, [Conv2DBiasAddPattern(cc), Conv2DPattern(cc), DensePattern(cc), AddPattern(cc), MultiplyPattern(cc)], skip_first=False)
calibrator = QuantizationCalibrator(quantizer, target='llvm', ctx=tvm.cpu(), dataset_manager=mnist_train_manager)
calibrated_func = calibrator.calibrate()
calibrated_mod = tvm.ir.IRModule.from_expr(calibrated_func)
requantized_func = Requantizer().requantize(calibrated_func)

E12 mnist_func

fn (%flatten_input: Tensor[(5, 28, 28, 1), float32], %dense_1/kernel:0: Tensor[(128, 10), float32], %dense_1/bias:0: Tensor[(10), float32], %dense/kernel:0: Tensor[(784, 128), float32], %dense/bias:0: Tensor[(128), float32]) {
  %0 = nn.batch_flatten(%flatten_input);
  %1 = transpose(%dense/kernel:0, axes=[1, 0]);
  %2 = nn.dense(%0, %1, units=None);
  %3 = add(%2, %dense/bias:0);
  %4 = nn.relu(%3);
  %5 = transpose(%dense_1/kernel:0, axes=[1, 0]);
  %6 = nn.dense(%4, %5, units=None);
  %7 = add(%6, %dense_1/bias:0);
  nn.softmax(%7)
}

E13 MNIST model after quantization, calibration and requantization:

fn (%flatten_input: Tensor[(5, 28, 28, 1), float32]) -> Tensor[(5, 10), float32] {
  %0 = nn.batch_flatten(%flatten_input) /* ty=Tensor[(5, 784), float32] */;
  %1 = qnn.quantize(%0, 0.00390625f /* ty=float32 */, 0 /* ty=int32 */, out_dtype="int8") /* ty=Tensor[(5, 784), int8] */;
  %2 = qnn.quantize(meta[relay.Constant][0] /* ty=Tensor[(128, 784), float32] */, 0.00453253f /* ty=float32 */, 0 /* ty=int32 */, out_dtype="int8", axis=0) /* ty=Tensor[(128, 784), int8] */;
  %3 = qnn.dense(%1, %2, 0 /* ty=int32 */, 0 /* ty=int32 */, 0.00390625f /* ty=float32 */, 0.00453253f /* ty=float32 */, units=128, out_dtype="int32") /* ty=Tensor[(5, 128), int32] */;
  %4 = qnn.quantize(meta[relay.Constant][1] /* ty=Tensor[(128), float32] */, 0.00390625f /* ty=float32 */, 0 /* ty=int32 */, out_dtype="int32", axis=0) /* ty=Tensor[(128), int32] */;
  %5 = nn.bias_add(%3, %4) /* ty=Tensor[(5, 128), int32] */;
  %6 = qnn.requantize(%5, 1.77052e-05f /* ty=float32 */, 0 /* ty=int32 */, 0.0267685f /* ty=float32 */, 0 /* ty=int32 */, axis=1, out_dtype="int8");
  %7 = nn.relu(%6);
  %8 = qnn.quantize(meta[relay.Constant][2] /* ty=Tensor[(10, 128), float32] */, 0.00579835f /* ty=float32 */, 0 /* ty=int32 */, out_dtype="int8", axis=0) /* ty=Tensor[(10, 128), int8] */;
  %9 = qnn.dense(%7, %8, 0 /* ty=int32 */, 0 /* ty=int32 */, 0.0267685f /* ty=float32 */, 0.00579835f /* ty=float32 */, units=10, out_dtype="int32");
  %10 = qnn.quantize(meta[relay.Constant][3] /* ty=Tensor[(10), float32] */, 0.0267685f /* ty=float32 */, 0 /* ty=int32 */, out_dtype="int32", axis=0) /* ty=Tensor[(10), int32] */;
  %11 = nn.bias_add(%9, %10);
  %12 = qnn.dequantize(%11, 0.000155213f /* ty=float32 */, 0 /* ty=int32 */, axis=1);
  nn.softmax(%12)
}

(Note that I didn’t skip the first or last pattern when quantizing this model, but you can if you want to).

Johnson9009 · February 19, 2021, 1:50am

What’s difference between current quantization framework and this one? Why we need implement a new one？ Is the current quantization framework will be replaced by this one? Thanks.

masahi · February 19, 2021, 8:51am

Thank you very much for taking on the daunting task of developing the new auto-quantization system! I believe this is a very important feature to have and get right, for high performance deployment in practice.

Can you summarize what is the scope of your current PR and what are left for future work? For example, I can see we want to support more calibration methods and data type. Other things I’m interested are:

Asymmetric quantization support. Looking at the code, since there are zero point stuff everywhere, I assume you have asymmetric support in mind. But the code is written in a way that only supports symmetric. For example, you are always quantizing to int8. The last dequantize op will always remain after requantize rewrite, but it seems you are always setting the output zero point of dequantize to 0. So the framework can only support symmetric quantization.
Which quantized ops are supported by default. I see only qconv and qdense supported in the PR, but there are more qnn ops.
Per channel quantization (it seems it is in the PR but it is not clear how well it is tested compared to per tensor). Per channel is very important, to maintain high accuracy for light-weight model like mobilenet v2/v3.
Dynamic quantization (important for transformer models)
A tutorial

masahi · February 19, 2021, 11:51am

I’ve done the first pass through the code and left some comments there. I generally liked pattern matching based QNN op rewrite and calibration, but I have a big concern around how you are approaching requantize.

First of all, regardless of the reason for introducing requantize in a later pass, I think the current implementation is too ad hoc and brittle when dequantize/quantize are not done back to back, see [WIP] [Quantization] Quantization in TVM by electriclilies · Pull Request #7474 · apache/tvm · GitHub. I’m pretty sure we will end up with more dequantize/quantize than necessary. For standard imagenet models, there should be only one quantize/dequantize pair without counting weight/bias quantize, and possibly one more between the last convolution and dense layer. Anything more than that is not acceptable for integer-only quantization, because that’s the norm in PyTorch/TFLite. In addition to accuracy and performance metrics on imagenet models, I’d like to see the number of quantize/dequantize remaining after requantize rewrite, for each imagenet models.

Second, if I understand your explanation on why requantize is done this way, I think the root issue boils down to the fact that you are doing calibration on a QNN graph. Sure, if you need to instantiate a QNN graph before scales and zps are determined, you cannot create requantize op. But reading your code I realized that calibration can be done either on the fp32 or QNN graph, so if I decide to calibrate only on the fp32 graph, your arugment for introducing requantize later wouldn’t apply anymore. Moreover, even if I decided to calibrate on the QNN graph, I don’t see why we need to go through the complicated and error-prone rewriting process to introduce requantize ops. After you determine all quantization parameters, you should be able to create a new QNN graph from the fp32 graph again, this time using requantize ops. The way I convert a quantized PyTorch model to QNN is actually similar, first I do one pass to make all qparams explicit in the graph, and then I do another pass to instantiate each QNN node, including requantize.

Finally, I’m not sure if doing calibration on a QNN graph is a good idea. I believe the standard approach is to calibrate on a fp32 graph, and then construct a quantized graph with requantize using the calculated qparams. You mentioned something about KL, but if I remember correctly KL only needs the histogram of activatations, so absolute values don’t matter and dequantize is not necessary (I could be wrong, though). Instantiating a QNN graph before qparams are chosen also introduces a nasty problem of how to decide the initial parameters. I think the final parameters depend on the initial ones, so the choice of initial values shouldn’t be arbitrary. Your code initializes all scales to be 1, but I think that is incorrect [WIP] [Quantization] Quantization in TVM by electriclilies · Pull Request #7474 · apache/tvm · GitHub. Anyway, I think the rationale for doing calibration on a QNN graph is questionable and it should only be for niche-use at best, I’m not convinced that it justifies the introduction of all the rewriting complexity for requantize.

Overall, if we want to deviate from the standard approach that is proven to work well, there should be a very good reason to do so. And ideally the justification should come with a working demonstration, rather than hand-waving explanations alone.

We can add dequantize after requantize, so this is not a problem.

electriclilies · February 19, 2021, 10:56pm

The scope of the current PR is adding the initial framework. Right now, I support quantizing nn.conv2d, nn.conv2d → nn.bias add, nn.conv2d → add, nn.dense, nn.dense->nn.bias_add, nn.dense → nn.add, as well as normal add and multiply. QNN.add and QNN.multiply were doing some weird things, so I decided to quantize add and multiply just using normal relay ops. The qnn ops I do not currently support are concatenate and convolution transpose, however, it is easy to add more QuantizerPatterns, so this can be done in the future or in this PR if it is necessary. I’m also introducing just two quantization algorithms in the initial PR, global calibration and average max calibration. There is support for asymmetric and per channel quantization. I have tested per-channel calibration quite a lot. I have successfully run the per-channel calibration algorithm on a cifar10 graph, resnet18 and a small mnist graph. With regard to dynamic quantization: I was not aware of the existence of dynamic quantization until last month ago. It’s possible that we could add it into the current framework by simply adding relay expressions in the place of scale and zero points variables. However, I have not looked into this in depth and it will probably require some more work (I’m not sure how much though). Right now, this is possible future work. I do intend a tutorial to be part of this PR, I just wanted to get it up to get some initial feedback about the design and post the PR.

With regard to asymmetric quantization, I’m not sure where you got that I don’t support asymmetric quantization. The AverageMaxCalibrationCallback does do this, but one could write another calibration method that returns any value for the zero point. In fact, the GlobalCalibrationCallback returns an arbitrary zero point. I do set the zero point of qnn.dequantize to zero in the Conv2D and Dense patterns, however, qnn.conv2d and qnn.dense shift the data and weight by the zero point before calculating the output. Essentially, qnn.conv2d and qnn.dense have already “dequantized” with respect to the zero points, so when we dequantize the output of conv2d and dense, we only need to dequantize with regard to the input scale, not the input zero point.

With regard to requantize, I do agree that the current implementation is a little brittle in that you have to explicitly list the ops that are allowed between qnn.dequantize and qnn.quantize. I wasn’t able to come up with a better method than this— the pattern matcher doesn’t allow us to do any op except a specific subset of ops, and even then I think we’d run into the same problem. However, it is easy to add more of these ops, and the current implementation does the job: on resnet-18, I am able to remove all dequantizes except the one at the end and all quantizes except ones quantizing an input to the original graph.

The reason why requantize is a separate step is partly that requantize doesn’t support expressions for scale and zero points, but also that to insert a requantize op, we need to know the scale and zero points of the next pattern. Your suggestion is doing something like dequantize(requantize(data, input_scale, input_zp, output_scale, output_zp), output_scale, output_zp). But what are the output_scale and output_zps?

The requantize and dequantize makes the output_scale and output_zp do nothing, essentially, so we could put an arbitrary constant in there, but then we end up having to do the same requantize step that I am doing now, except we have to remove the dequantize as well as the requantize.

The other option is to try to set output_scale and output_zp to the scale and zero point of whatever is consuming the requantize, like so: quantize(dequantize(requantize(data, input_scale, input_zp, output_scale, output_zp), output_scale, output_zp), output_scale, output_zp).

However, this would require us to look ahead and quantize the consumer of the current pattern before quantizing the current pattern itself. Additionally, if there are other ops that we are not matching using a QuantizerPattern in-between the dequantize and the quantize, we also have to copy all of those. We would still have to look ahead during quantization were we to calibrate with only the FP32 graph and insert only requantizes. As a result, it would be tricky to make the QuantizerPatterns completely modular and would limit the extensibility of the framework. Additionally, if we have multiple consumers of the pattern we are trying to quantize (i.e., branches), it’s not clear how to proceed. Making two copies of quantize(dequantize(requantize(data, input_scale, input_zp, output_scale, output_zp), output_scale, output_zp), output_scale, output_zp) with different variables for output_scale and output_zp doesn’t make a ton of sense, but we’d have to do something like that to be consistent with cases where we do not branch.

To address your concern about the use of dominator patterns to do requantize: I added allow_overlapping_patterns to the pattern matcher to address this exact issue. With that option on, if we have multiple quantizes consuming one dequantize, as in a resnet or other graph with branches, we will match two patterns, dequnatize → quantize_a and dequantize → quantize_b, and turn them into two requantizes. This is actually the behavior that we want, because in most cases, we will actually have: quantize(dequantize(data, scale_out, zp_out), scale_a, zp_a) and quantize(dequantize(data, scale_out, zp_out), scale_b, zp_b), so we should turn them into separate requantize ops: requantize(data, scale_out, zp_out, scale_a, zp_a) and requantize(data, scale_out, zp_out, scale_b, zp_b).

There may be a few cases where scale_a == scale_b and zp_a == zp_b and we could technically combine the two requantizes into one, but this is an edge case, and will not occur in most data_aware quantization methods. I can add a step to remove duplicate requantizes if you think it is necessary, though.

In my opinion, I think the robustness we gain here with regard to dealing with branches is worth adding the extra pass. Like Ziheng said, the current quantization code does have to do a lot of tricky analysis because of branching. Keeping the QuantizerPatterns completely modular and dealing with any branching that occurs in the requantization step completely eliminates the need to deal with it during quantization or calibration. And even in requantization, I don’t actually ever do any special analysis to figure out what to do in the case of a branch — it’s all taken care of by the pattern matcher.

To address your point about initializing scales to 1, I just initialize all the scales and zero points so that the tuple graph will run. Users will never be able to access values that were calculated using the initial scales and zero points. They are required to pass in their own initial guesses, we are calibrating in topological order, and my utility functions control indexing into the output of the tuples. I picked one and zero arbitrarily since the values don’t really matter since they will never be used in a calculation that is exposed to the user.

masahi · February 19, 2021, 11:40pm

See [WIP] [Quantization] Quantization in TVM by electriclilies · Pull Request #7474 · apache/tvm · GitHub You don’t pass out_dtype param to qnn.quantize, so you are always using the default int8. To support asymmetric you need to pass out_dtype=uint8.

This is again not correct for asymmetric case. Nonzero zero point is rare, but when there is indeed nonzero zero point, you need to pass that to qnn.dequantize. This is handled by requantize rewrite when the pat matching succeeds, but as I said, the last dequantize will always remain because there is no quantize op to pair it up.

masahi · February 20, 2021, 12:05am

Please read my second and third point in the post [RFC][Quantization] Quantization in TVM - #4 by masahi more carefully. I’m not asking you to think more or work harder on improving requantize rewrite.

My main points are:

I believe the whole requantize rewrite complexity is unnecessary. Why do you need to reuse the half-baked QNN graph used for calibration? Why not create a new QNN graph from scratch, using requantize ops with calculated qparams?
I don’t think doing calibration on a QNN graph is a good idea. This is the very reason you’ve hit the problem of requantize not accepting non constant qparams. QNN is designed to be translated from a graph where all qparams have been already determined.

In particular, I’d like to hear your thoughts on the following suggestion:

I believe the current design deviates too much from the established standard without reasonable justification. I prefer doing calibration on a fp32 graph, and creating a QNN graph with requantize ops in one pass. Calibration on a QNN graph can be optionally supported, but this should not bring the complexity of requantize pat matching.

masahi · February 20, 2021, 12:46am

No this is not optimal. There should be only one requantize and the same node should always be quantized using the same qparams. Also see https://github.com/apache/tvm/pull/7474#discussion_r579072951 to me it looks like you are quantizing the same arg more than once.

No framework I am aware of requires users to pass in the initial guesses. Why do you think this is a good idea? The values chosen by users affect the accuracy of models, so they need to have a very educated guess. You are putting unnecessary burden on users, due to your framework design.

anijain2305 · February 22, 2021, 11:25pm

Thanks Lily (@electriclilies ) for the hard work. Automatic quantization is a must for TVM. I appreciate that a lot of this design is driven by how a TVM user is going to perform quantization, and this is evident in how you use PatternMatchers to facilitate quantization. That part is pretty cool.

The main concern I have is - Calibration on QNN graph for which I share same concern as Masahi.

It is unclear if we should calibrate on QNN graph. The standard way of quantizing is to collect the FP32 intermediate outputs, find suitable scale and zero points and then prepare a quantized graph. Your approach turns this around, which adds complexity and uses QNN ops in a non-standard way.

Now, you can argue that the errors accumulate and propagate as we go deeper into the network, so the deeper layers’ original FP32 values might have big deviation from the actual quantized values. Therefore, a TVM user should be able to see the actual quantized values as well while calibration. But, the current design does not solve this issue either because (a) quantize-dequantize gives different values than a requantize op because requantize approximates float division with fixed point division, and (b) running conv2d in FP32 vs int8 leads to different results, so the two are not same when you add quantize-dequantize layers. On top of this, the deviation quickly grows when we start using fewer than 8 bits (which I expect TVM users to do).

Therefore, I am not entirely convinced that calibration on QNN graph is necessary. Please do provide some data points or some TVM user anecdotes if you believe otherwise. Since, Calibration on QNN graph is central to your design, I would request you to spend more time thinking if it’s necessary before jumping into the future items that you listed above.

anijain2305 · February 22, 2021, 11:34pm

@masahi There are some points where Lily is correct here.

For qnn.conv2d and qnn.dense, the output_scale = data_scale * kernel_scale and output_zero_point = 0, with standard out_dtype as int32. If the inputs to conv2d or dense were asymmetric, the qnn.conv2d/dense will internally adjust the computations accordingly, but the output scale and zero point still remain same as above. This output is then later passed through a requantize operator which takes this int32 output and converts to a specified output scale and dtype.
For asymmetric, we can do asymmetric on both int8 and uint8 datatypes. Maybe Lily is using only int8 datatype for now.

masahi · February 22, 2021, 11:38pm

Yes, but what about the final dense, for example? Currently it does qnn.dense → dequantize with zp always 0. The last dequantize will not be replaced by requantize, and so the final zp is always 0 and the outputs are centered around 0.

And doing asymmetric with int8 only gives us the range [0, 128), no?

masahi · February 23, 2021, 1:24am

per offline discussion with @anijain2305, my comment on this framework not supporting asymmetric Q seems not correct. I was assuming that asymmetric == “uint8”, but according to @anijain2305 this is a common misconception and it is totally reasonable to have int8 Q with nonzero zero point, which is the correct definition of asymmetric Q. And this framework does seem to support *asymmetric" Q in the sense above.

And dequantize always having 0 as zero point also seems to be fine in this framework. I realized my view is heavily biased towards how PyTorch does quantization, where quantized dense is uint8 -> uint8 and requantize is not explicit. So the correct translation to QNN for the final dense must be qnn.dense → requantize → dequantize, and that dequantize can take on non zero input zero point (which is the output zp of requantize). Since in this framework dequantize input is always the output of qnn.dense (or qnn.conv), whose output zp is always 0, it is fine.

I’m sorry for making false statement! @electriclilies

electriclilies · February 23, 2021, 4:01pm

@Johnson9009 The new quantization framework is designed to be more modular, flexible and extensible. People will be able to easily implement new methods for calibration without changing any other part of the framework. Additionally, because we use the pattern matcher to quantize individual patterns, it will be easy to pick and choose what parts of a graph you want to quantize, and easy to add new patterns.

The current plan to introduce this code in a new namespace in TVM called “experimental”. The purpose of the experimental namespace is to house features that are new and may have large changes or additions in the future. Eventually, experimental features will be migrated to mainstream TVM as they stabilize and gain users.

electriclilies · February 23, 2021, 4:30pm

@animesh and @masahi Thanks for the feedback about calibrating with QNN. I will definitely do some investigation about how well calibrating with QNN works before proceeding with any other features. I’ll try to provide some data points to you soon.

@animesh, I hadn’t thought of point a), I’ll definitely take that into consideration. I’m not sure I understand point b). Could you explain your logic a bit more?

Also, even if calibrating with QNN input values doesn’t work very well, I think that it is still useful to expose the parts of the QNN graph in the callback. Specifically, I also provide access to the output of each layer (quantized and unquantized), so you will be able to do a comparison between these and see how much accuracy you lose on a per pattern basis, as you calibrate the graph. I think this could be very useful for figuring out what parts of graphs benefit from being quantized and what parts don’t, or if a specific calibration method is performing badly on a certain pattern or certain part of the graph. Correct me if I’m wrong, but my understanding is that right now, there isn’t a way to look at accuracy loss throughout the graph, only holistically.

Additionally, if (or when) in the future we support quantization for datatypes like int4 or int2, etc, it might be useful to try calibration with different datatypes, compare the accuracy loss at the current layer, and based on that decide which datatype to use. Supporting this feature for additional datatypes probably would require some changes to the framework (as well as adding qnn support for other datatypes), but it is another scenario where exposing the QNN graph is useful.

sergey · January 9, 2024, 6:17am

Hi Lily (@electriclilies), I’m exploring a HW technology that requires dynamic / data-dependent graph transformations and your approach seems to be very relevant to this kind of use case. Did you get a chance to create a tutorial / make further progress on this approach? I haven’t been following quantization discussions closely so any pointers on where quantization in TVM is headed are very much appreciated.