For auto-scheduler/autotvm, did you re-tune the model or did you directly reuse the tuning logs provided in TLCBench?
In TLCBench, I also compared auto-scheduler vs. “llvm -libs=cblas” and found auto-scheduler is similar or slightly better than “llvm -libs=cblas”.
If you have access to AWS, could you also try to run the benchmark on c5.9xlarge? If you use c5.9xlarge, you can direct reuse the log files in TLCBench and see whether you can reproduce the latency numbers listed here. This is a good sanity check for your setup.
There can also be bugs when converting models from pytorch. I am not sure.