[RFC] Building a new reproducible benchmark for TVM

merrymercy · November 21, 2020, 2:32am

Motivation

Currently, TVM lacks an up-to-date and reproducible benchmark. The only benchmark is hosted at tvm/apps/benchmark. However, this benchmark is too old and has several flaws.

The results were obtained 2 years ago.
The deep learning models are old. It does not include new models (e.g., BERT, EfficientNet)
The input format is TVM’s internal relay format. It does not use formats from high-level frameworks (e.g., pytorch, mxnet) or open exchange format (e.g., ONNX).
It does not cover Intel CPUs.
It only provides pre-tuned configurations by tophub, but does not provide the scripts to generate these configurations.

This RFC aims at building a new open, reproducible bechmark for TVM. When the new benchmark is ready, we can run evaluation nightly and run auto-tuning weekly or monthly.

Approach

As the first step, we target three models, three hardware platforms and four code generation strategies. To make the comparision with other frameworks easier, we choose ONNX as the input model format.

models: resnet-50, mobilenet v2 and BERT with batch size 1
hardware platforms: NVIDIA GPU, Intel CPU, ARM CPU
code generation strategies: autotvm, auto-scheduler, tvm + manual library, ONNX-runtime.

All logs generated during the auto-tuning should be uploaded for future references.

I created one a repo TLCbench and opened a roadmap. I am seeking for contributors who are interested in helping me.

ziheng · November 21, 2020, 2:54am

Great suggestion!

Can we make it as a nightly/weekly regression test utils and also consider adding accuracy evaluation for quantization model into this loop?

kevinthesun · November 21, 2020, 5:02am

Yeah. A performance regression test would be very nice. There are a lot of times we need to do binary search to find the commit causing regression.

comaniac · November 22, 2020, 1:18am

Glad to see this is being planned! I could help on this as much as I can.

One question/suggestion is that if we are going to have such formal benchmarking approach, maybe we can make it MLPref friendly so that everyone can use this TVM utility to run these models on the target platform and submit the results to MLPref.

tqchen · November 22, 2020, 2:33am

It would also be great to consider output https://tvm.apache.org/docs/dev/benchmark.html and iterate on a common log format

zhiics · November 22, 2020, 3:18am

It is really nice to add the regression tests against a selected set of models, since the down streams users usually have to spend quite amount of time to find the root cause once there is a regression. Or they have to sync the upstream codebase as frequent as possible and test regression locally.

cc @jroesch, you may have some comments about the output format or the UX of the test infra.

FrozenGene · November 22, 2020, 7:09am

One question for the performance regression, how to judge the normal fluctuation, especially CPU? Like resnet50 maybe 20.00ms, but becomes 20.88ms after one pr?

giuseros · November 27, 2020, 6:36pm

Hi @merrymercy,

I think that this is a great idea! Few questions:

Do you think that packaging is a requirement for this work? The point is that if we don’t have binary packages available (where users can simply do pip install tvm) it would be hard for external users to reproduce the benchmark results
Are you planning to use tvmc under the hood? I believe that using tvmc would largely simplify scripts like benchmark_autotvm.py, etc…
Why adding a separate repository? Why not adding directly into the main one?
Why focusing on ONNX? I believe that ONNX support in TVM lacks quantization (correct me if I am wrong), which would be nice to test as well.

Thanks,

merrymercy · January 12, 2021, 6:39pm

Hi @giuseros,

Thanks for the good suggestions.

This is already addressed by TLC pack, by which you can install tvm by pip

No, I think script is more flexible. But I will investigate tvmc later.

We can move it to mainline and replace the old one when it is mature. But now we keep it as a separate repo for faster development.

What other formats do you think are better?

merrymercy · January 12, 2021, 6:41pm

I updated some initial results on https://github.com/tlc-pack/TLCBench. They include the performance of autoscheduler and autotvm on AWS c5.9xlarge and g4.4dnxlarge.

Contributions are welcome for more model and backend coverage.