[RFC] Building a new reproducible benchmark for TVM

One question for the performance regression, how to judge the normal fluctuation, especially CPU? Like resnet50 maybe 20.00ms, but becomes 20.88ms after one pr?