Background: There has been many papers in academic literature on quantizing weight tensors in deep learning models to reduce inference latency and memory footprint. Also, TVM recently gained the ability to quantize weights (https://github.com/dmlc/tvm/pull/2116)
I am currently working on a systematic benchmark for existing frameworks for (post-training) quantization. A rigorous benchmark will help machine learning practitioners make informed decisions. Any suggestions are welcome.
Frameworks:
- TVM
- MXNet: quantization example
- TensorFlow Lite: quantization tutorial
Models: for now, only Image Classification.
- Inception V3
- ResNet-50 V1
- ResNet-152 V1
- DenseNet201
- MobileNet V2_1.0
Metrics to measure:
- Top-1 / Top-5 Accuracy on ImageNet Validation Set
- Inference time per image
- Model size in memory (MB)
- Model size as a serialized file (MB)
Benchmark environment: AWS EC2 C5d.18xlarge (72 vCPUs)
- To account for variance due to virtualization and shared hardware, I will perform multiple trials by launching new instance(s). I will perform statistical tests to verify if an apparent difference in a metric is significant or due to chance.
To ensure that we are comparing apples to apples, benchmark code (with links to models used) will be published as a GitHub repository.