[PyTorch][QNN] Cannot import Quantized MobileBERT produced by FX graph mode quantization

mgeek · December 16, 2022, 9:34am

Hi, I am trying to import a scripted quant-MobileBERT torch model via the relay.frontend.from_pytorch API, which works fine for quantized vision models like resnet50. However, I met the following problem (maybe due to the complexity of BERT model):

When it comes to

4171: input_scales_for_bias = qnn_torch.add_input_quant_params_to_op_inputs(graph)

inside the qnn_torch.add_input_quant_params_to_op_inputs() function,

for

490: if "quantized::conv" in operator or "quantized::linear" in operator:
        # This is required for quantizing the bias
        assert len(input_scales) == 1, "One quantized parameter expected for qconv or qlinear."
        input_scales_for_bias[node.inputsAt(1).debugName()] = input_scales[0].node().f("value")

it expects input_scales[0].node() to be something like:

%493 : float = prim::Constant[value=0.01865844801068306]()

(The case in quantized resnet50)

Where it can directly pull out the value of the scale since it is a constant.

However in the MobileBERT case input_scales[0].node() is actually something like:

%cat_output_scale_0.1 : Tensor = prim::GetAttr[name="cat_output_scale_0.1"](%self.1)

It is not a constant straight out of the box and needs to call the function to get the scale attribute from %self.1( AKA the script module). Thus a error will be thrown out since there’s actually no “value” key attached to input_scales[0].node().f here.

Actually, I tried to hack this by passing in the script_module and get the scale constant by hand via something like float(getattr(script_module,func_name)) and it does eliminates the errors here temporarily. However this is not a complete solution and I still get more errors from the following steps since some of the nodes cannot be recognized/parsed correctly.

Any suggestions for this problem? Thanks : )

@masahi @lhutton1 @comaniac

masahi · December 16, 2022, 8:11pm

The best solution would be to “inline” such GetAttr, so that qparams area always directly accessible. I believe such transformation is not hard to do in FX.

mgeek · December 19, 2022, 9:15am

Hi masahi, thanks for the advice! I “inlined” all constants on the torch FX end and the problem is solved. However, I met some new errors when it comes to the new_mod = transform.InferType()(new_mod) process after converting a linear layer op.

The error says:

The Relay type checker is unable to show the following types match:
  Tensor[(384), int32]
  Tensor[(512), int32]
In particular:
  dimension 0 conflicts: 384 does not match 512.

Since it is a quantized BERT model and the shape of the input tensor of this Linear Layer is (1,384,384), and the weight shape is (384, 512), which is slightly different from what computer vision models’ linear input looks like. Will this messes up the default quantization operations and thus causing the error?

masahi · December 19, 2022, 11:11am

Our dense op expects the input shape to be transposed, i.e. (512, 384) for your case. You can add op.tranpose on your weight.

mgeek · December 19, 2022, 11:32am

Ah sorry, there’s a typo here. The weight’s shape is already(512, 384) and I’m still getting this error. My guess is that the quantized dense operator is expecting a 2 dimension input tensor and a 2 dimension weight here. Since the input tensor shape in BERT case here is a 3 dimension vector, it can’t deal with it properly. I’m thinking of using the quantized batch_matmul operator here as an alternative and reshaping the weight tensor to (1,512,384) to match the input tensor’s (1, 384, 384). Do you think this is doable?

mgeek · December 19, 2022, 11:34am

Like in non-quantized situation, there’s a adaptive mapping strategy:

def linear(self, inputs, input_types):
    # https://pytorch.org/docs/stable/nn.functional.html#linear
    # 0 - input
    # 1 - weight
    bias = inputs[2]
    a_shape = self.infer_shape_with_prelude(inputs[0])
    b_shape = self.infer_shape_with_prelude(inputs[1])
    if len(a_shape) == 2 and len(b_shape) == 2:
        mm_out = _op.nn.dense(inputs[0], inputs[1])
    elif len(b_shape) == 1:
        mm_out = self.matmul([inputs[0], inputs[1]], input_types[:2])
    else:
        mm_out = self.matmul(
            [inputs[0], _op.transpose(inputs[1], axes=(1, 0))], input_types[:2]
        )
    if isinstance(bias, _expr.Expr):
        bias_ndims = len(self.infer_shape_with_prelude(bias))
        if bias_ndims == 1:
            return _op.nn.bias_add(mm_out, bias, axis=-1)
        mm_dtype = self.infer_type_with_prelude(mm_out).dtype
        return self.add([mm_out, bias], [mm_dtype, input_types[2]])
    return mm_out

I guess that’s what we are missing for quantized case in order to support BERT.

mgeek · December 20, 2022, 1:49am

Do you think there’s a workaround for this case?

masahi · December 20, 2022, 4:23am

Given the error message above, I think it is complaining about the shape of the bias tensor.

mgeek · December 20, 2022, 5:43am

I reshaped the input tensor’s shape from (1, 384, 384) to (384, 384) for qnn.dense to take it in properly, it seem working for now… Still need to check the final correctness of the whole model. However if I have to do this eventually then the batch size cannot be larger than 1

mgeek · January 6, 2023, 6:45am

Just a quick update, I solved this problem.

It turns out that you are correct.The default requantize and bias_add axis are set to 1, where it should be axis=-1 or axis=2 in my case since the output of my Dense operator is 3 dimensional (1,384,512) and the bias, as well as requantize factors, should be applied on the last axis.

The problem is solved by changing the axis param of requantize and bias_add operator to -1.

kgbounce · March 14, 2023, 9:09pm

Hi @mgeek, great to hear that you managed to get the quantized mobileBERT compiled by TVM. Could you please share your setup and scripts? I was searching for quantized Pytorch model compilation, and there are way more questions than answers. It’d help the community greatly if we can learn from your investigation. Thanks.

mgeek · March 15, 2023, 8:14am

Sure, I guess I can write a post about it as soon as I got some time. However it’s not easy to put all changes in one script, since it involves some minor changes here and there…

mgeek · March 15, 2023, 8:16am

If you are stuck somewhere as well when trying to import quantized mobileBERT, I’m more than glad to give it a look and see if there’s anything I can do to help.

kgbounce · March 15, 2023, 4:37pm

Hi @mgeek thanks. I can try to go through this process with your help and summarize the learning in a post and share. I’ll start with DM first.