Benchmark results do not behave as expected

Hi all, This post is a continuing discussion from Use all cores in a big.LITTLE architecture:

I am working on Hiker 970 which contains 4 A73 big cores and 4 A53 small cores. I used “module.module.time_evaluator(“run”, dev, number=1, repeat=repeat)” to benchmark BERT models from huggingface and use config_threadpool to set the number of thread.

However, I found out two weird behaviors as the following figures show:

  1. running with 4 small (42.57ms) outperforms running with 4 big cores (56.21ms).
  2. running with 4 big and 4 small at the same time (37.12ms) outperforms running only with either 4 big or 4 small only (42.57ms).

image image

Therefore, I referred to tvm benchmark wiki to see what other networks behave. I got a similar result in sqeueezet and mobilenet: Running with 4 small even outperform running with 4 big cores and running with 4 big and 4 small outperform running only with either 4 big or 4 small only.

image image

The only exception is benchmarking “simple multilayer perception” whose behavior is more reasonable. image

In simple multilayer perception, running with 4 big (14.42ms) outperforms running with 4 small cores (24.12ms). Running with 4 big and 4 small at the same time (14.82ms) barely outperforms running only with either 4 big or 4 small only (14.42ms) due to the communication cost of cores and cores.

Other info: I have run some simulations on TensorFlow lite benchmark and see the performance degrade when using all cores at the same time. (Fig2)

Does anyone have any thoughts on that? Thanks for your input in advance:)

Kindly ask if there are any thoughts on this question. Running with small cores outperforms running with big cores makes no sense to me.

Thanks for community in advance.

Something like BERT or anything transformer is more likely to be memory rather than compute bound. My guess would be it has to do with CPU cache behavior or how your system memory works. I’m not sure though.

1 Like

@AndrewZhaoLuo Thanks for your replying.

Maybe I can try running with my desktop and see how it behaves. As we see, only the result of a simple “multilayer perception” is more reasonable, so I am expecting a relatively light CNN model such as mobilenet can still have a similar trend but it turns out not, it still makes me confused.

More thoughts are welcome :slight_smile: Thanks, community.

My desktop doesn’t have such Big and small cores, so I am not able to reproduce the result. I indeed saw when the number of cores increases, the performance will improve.

However, running on small clusters outperforming big clusters still makes no sense to me. Kindly ask if there are any thoughts on this question. Thanks.