[RFC] Building a new reproducible benchmark for TVM


Currently, TVM lacks an up-to-date and reproducible benchmark. The only benchmark is hosted at tvm/apps/benchmark. However, this benchmark is too old and has several flaws.

  1. The results were obtained 2 years ago.
  2. The deep learning models are old. It does not include new models (e.g., BERT, EfficientNet)
  3. The input format is TVM’s internal relay format. It does not use formats from high-level frameworks (e.g., pytorch, mxnet) or open exchange format (e.g., ONNX).
  4. It does not cover Intel CPUs.
  5. It only provides pre-tuned configurations by tophub, but does not provide the scripts to generate these configurations.

This RFC aims at building a new open, reproducible bechmark for TVM. When the new benchmark is ready, we can run evaluation nightly and run auto-tuning weekly or monthly.


As the first step, we target three models, three hardware platforms and four code generation strategies. To make the comparision with other frameworks easier, we choose ONNX as the input model format.

  • models: resnet-50, mobilenet v2 and BERT with batch size 1
  • hardware platforms: NVIDIA GPU, Intel CPU, ARM CPU
  • code generation strategies: autotvm, auto-scheduler, tvm + manual library, ONNX-runtime.

All logs generated during the auto-tuning should be uploaded for future references.

I created one a repo TLCbench and opened a roadmap. I am seeking for contributors who are interested in helping me.


Great suggestion!

Can we make it as a nightly/weekly regression test utils and also consider adding accuracy evaluation for quantization model into this loop?

Yeah. A performance regression test would be very nice. There are a lot of times we need to do binary search to find the commit causing regression.

Glad to see this is being planned! I could help on this as much as I can.

One question/suggestion is that if we are going to have such formal benchmarking approach, maybe we can make it MLPref friendly so that everyone can use this TVM utility to run these models on the target platform and submit the results to MLPref.

1 Like

It would also be great to consider output https://tvm.apache.org/docs/dev/benchmark.html and iterate on a common log format

It is really nice to add the regression tests against a selected set of models, since the down streams users usually have to spend quite amount of time to find the root cause once there is a regression. Or they have to sync the upstream codebase as frequent as possible and test regression locally.

cc @jroesch, you may have some comments about the output format or the UX of the test infra.

One question for the performance regression, how to judge the normal fluctuation, especially CPU? Like resnet50 maybe 20.00ms, but becomes 20.88ms after one pr?