TVM and BLAS libraries

It depends. What’s your workload? Is that just a dense op or an entire network? Dense may not be the performance bottleneck if you profile an entire network so that the impact of using CBLAS would be moderated. I did an experiment months ago using 512x512 matrix to perform dense with and without CBLAS. The one with CBLAS is ~1.25x faster.

If no external library specified (e.g., llvm), TVM will generate LLVM IR and then lowers to the machine code directly. In this case, it’s suggested to specify at least the CPU model to make TVM use AVX instructions. For example: llvm -mcpu=skylake-avx512.