[RFC] Building a new reproducible benchmark for TVM

Great suggestion!

Can we make it as a nightly/weekly regression test utils and also consider adding accuracy evaluation for quantization model into this loop?