At the TVM Community Meeting we discussed that in the tvm quantization flow, the batch_norm is legalized, rather directly instrumented and quantized. @mbaret and if we could provide a reproducer for this.
This post is the reproducer, using the current main TVM distribution with small changes given in the provided patch (tvm_code_changes.patch), and a simple onnx model (model.onnx) with 3 convolutions and the batch normalization after the second convolution and input file (input_10.json). Further, a jupyter notebook (test_quantized.ipynd) is provided which runs the quantization flow. This reproduces the issues observed by Nikhil
The complete logfile (output_bn_reproducer.txt) of the run is provided.
The code changes are to include instrumentation of the batch_norm, and printing the new_args[0] of the conv layer and to set the diagnostics level to kHelp (most detailed info).
In the logfile you see
For each conv layer a print section that is marked with DEBUG_BN_QUANT_conv_begin_new_args[0] and DEBUG_BN_QUANT_conv_end_new_args[0].
The 3rd conv layer, which is after the batch_norm results in an error after printing the new_args[0]
In the new_args[0] output you see that simulated_quantization was added to relay after batch_norm
ā
%14 = relay.op.annotation.simulated_quantize(%13, %dom_scale6, %clip_min6, %clip_max6, kind=1);
%15 = relay.op.annotation.simulated_quantize(meta[relay.Constant][4] /* ty=Tensor[(64), float32] */, %dom_scale7, %clip_min7, %clip_max7, kind=2);
%16 = relay.op.annotation.simulated_quantize(meta[relay.Constant][5] /* ty=Tensor[(64), float32] */, %dom_scale8, %clip_min8, %clip_max8, kind=2);
%17 = relay.op.annotation.simulated_quantize(meta[relay.Constant][6] /* ty=Tensor[(64), float32] */, %dom_scale9, %clip_min9, %clip_max9, kind=2);
%18 = relay.op.annotation.simulated_quantize(meta[relay.Constant][7] /* ty=Tensor[(64), float32] */, %dom_scale10, %clip_min10, %clip_max10, kind=2);
%19 = nn.batch_norm(%14, %15, %16, %17, %18, epsilon=0.01f);
free_var %dom_scale11;
free_var %clip_min11;
free_var %clip_max11;
%20 = relay.op.annotation.simulated_quantize(%19, %dom_scale11, %clip_min11, %clip_max11, kind=1);
%21 = %20.0;
nn.relu(%21)
ā
The Error messages include:
The type inference pass was unable to infer a type for this expression.
This usually occurs when an operator call is under constrained in some way, check other reported errors for hints of what may of happened.
Please let us know if you need any other material or if you run into any issues running this reproducer.