[RFC] Building a new reproducible benchmark for TVM

Yeah. A performance regression test would be very nice. There are a lot of times we need to do binary search to find the commit causing regression.