I’m working on int8 calibration and have some observations on the choice of the precision of bias.
I implemented per-layer activation scale (similar to this pr https://github.com/dmlc/tvm/pull/2753) and used a simple calibration method by taking power2 scale of maximum abs value of each layer’s output on a small calibration dataset. This approach works well (imagenet abs top-1 accuracy drop ~1%)on some models resnet-50, vgg-16.
My patch will change the scale selected in AddRealize. Previously scale of lhs is selected in almost all cases because lhs scale is from conv2d, which is multiplication of input and weight scale (smaller than rhs), so bias will be shifted left in this case.
After my patch, bias scale will be selected and conv2d result will be shifted left.
Left-shift in either cases doesn’t lead to overflow and should not cause precision loss. But it is possible that bias are shared per-channel so more bits help.
I would like to discuss the observation here and see how we can improve the quantization accuracy.
Some of my experiment numbers:
resnet101 top1/top5 accuracy on Imagenet (first 3k images)
I’m currently working on fully automatic calibration that just does the most commonly used method: picked the domain scale that minimizes the L2-norm. We use this method to implement calibration for both per-layer and optionally per-channel scales—and potentially weights as well. This also provides some improvements to the current quantization accuracy results, so I am interested to see what we can get with the combination of all the improvements.
By the way, have you tried your pass on the v2 versions of resnet in the mxnet model zoo? I see catastrophic accuracy drops with the current pass on those models, and I wonder if it is due to problems with the bias.
Yes resnet mentioned above are v2. Resnet-50,101 v2 have ~1% drop.
I observed significant accuracy drop on resnet18 v2 using power2 scale activation instead of global scale, but the accuracy is normal after setting skip_k_conv = 2
btw I’m also working on KL divergence based scale. I assume it will have similar results to L2 norm based ones.
The usecase quantized resnet.
Consider a residual add case
data
| |
bn |
| |
relu |
| |
conv |
| /
add
After quantization, there will be duplicated simulated_quantization(data, INPUT) in two branches.
Writing & reading int32 result to/from global memory can be slow so we use stop fusion to ensure that subfunction output int8 results. We don’t combine cast(i32) so that cast(i32) will be done in each consumer subfunction.
I am not sure I understand the situation, can you annotate where the casts are in your example?
Right now I think the current fskip implementation is too brittle and has consequences downstream. In my quantization use case skipping cast(i32) causes every identity branch to be exhaustively recomputed because CSE stops at that step.
data is usually output of previous conv2d. There are duplicated simulated_quantize. Followed add in both branches will convert the int8 to int32. So simulated_quantize + add in both branches which will be translated to right_shift + cast(i8) + cast(i32)
We use stop_fusion to ensure that previous conv2d result will be casted to int8 before saving in global memory.
You will see the difference running quantized ResNet-50 v2.
So the issue is I think we have somewhat different use cases :); I am prototyping per-channel quantization on CPU, where the compute:bandwidth ratio is lower so the different is probably not as apparent. However, in my situation preventing the casts from being removed also explodes even resnet-18 to over 3000 intermediate values which is far worse than the bandwidth overhead. I wonder if modifying the annotate pass to treat the adds differently here will work.