@comaniac The workload is a complete network, ResNet-like. There’s only one dense layer at the very top. When building TVM with openblas/atlas and enabling -libs=cblas performance gets worse by ~3%. When building TVM with MKL, there’s no difference between performance w/ or w/o -libs=cblas. This behaviour is kind of unexpected imho.
By the way, I’m already specifying -mcpu in the target.