Currently, TVM lacks an up-to-date and reproducible benchmark. The only benchmark is hosted at tvm/apps/benchmark. However, this benchmark is too old and has several flaws.
The results were obtained 2 years ago.
The deep learning models are old. It does not include new models (e.g., BERT, EfficientNet)
The input format is TVM’s internal relay format. It does not use formats from high-level frameworks (e.g., pytorch, mxnet) or open exchange format (e.g., ONNX).
It does not cover Intel CPUs.
It only provides pre-tuned configurations by tophub, but does not provide the scripts to generate these configurations.
This RFC aims at building a new open, reproducible bechmark for TVM. When the new benchmark is ready, we can run evaluation nightly and run auto-tuning weekly or monthly.
Approach
As the first step, we target three models, three hardware platforms and four code generation strategies.
To make the comparision with other frameworks easier, we choose ONNX as the input model format.
models: resnet-50, mobilenet v2 and BERT with batch size 1
hardware platforms: NVIDIA GPU, Intel CPU, ARM CPU
Glad to see this is being planned! I could help on this as much as I can.
One question/suggestion is that if we are going to have such formal benchmarking approach, maybe we can make it MLPref friendly so that everyone can use this TVM utility to run these models on the target platform and submit the results to MLPref.
It is really nice to add the regression tests against a selected set of models, since the down streams users usually have to spend quite amount of time to find the root cause once there is a regression. Or they have to sync the upstream codebase as frequent as possible and test regression locally.
cc @jroesch, you may have some comments about the output format or the UX of the test infra.
One question for the performance regression, how to judge the normal fluctuation, especially CPU? Like resnet50 maybe 20.00ms, but becomes 20.88ms after one pr?
Do you think that packaging is a requirement for this work? The point is that if we don’t have binary packages available (where users can simply do pip install tvm) it would be hard for external users to reproduce the benchmark results
Are you planning to use tvmc under the hood? I believe that using tvmc would largely simplify scripts like benchmark_autotvm.py, etc…
Why adding a separate repository? Why not adding directly into the main one?
Why focusing on ONNX? I believe that ONNX support in TVM lacks quantization (correct me if I am wrong), which would be nice to test as well.
I updated some initial results on https://github.com/tlc-pack/TLCBench.
They include the performance of autoscheduler and autotvm on AWS c5.9xlarge and g4.4dnxlarge.
Contributions are welcome for more model and backend coverage.