Strassen Algorithm for Dense

On your case, current code is will call 4 cores (id 0 ~ 3). So parallel brings you better performance.

About time consuming functions, Do you use auto tvm? If you use auto tvm, the default cpu TVM uses is big core (that is index 7). If you decide to use 4 little cores, you should make auto tvm use these 4 little cores too. One elegant way is we should have thread_mod to make users set (see link: autotvm.RPCRunner and TVM_NUM_THREADS). Current workaround could be done we disable core 4, 5, 6, 7 on devices temporally. (We indeed to provide one interface for users how to control big / little cores when to tune).