[Performance] TVM - pytorch BERT on CPU

Thanks for the plentiful information.

For Q1, when you extract tasks with llvm -mcpu=skylake-avx512 -libs=cblas, some operators (i.e., dense) will be offloaded to cblas. It means those operators won’t be compiled by the TVM codegen, so AutoScheduler won’t see and tune them.

For Q2, the two differences you pointed out seem not really impactful. Maybe you can try to use debugger to compare the latency breakdown between two models: Debugger — tvm 0.8.dev0 documentation