Benchmark results do not behave as expected

popojames · October 12, 2021, 6:12pm

Hi all, This post is a continuing discussion from Use all cores in a big.LITTLE architecture:

I am working on Hiker 970 which contains 4 A73 big cores and 4 A53 small cores. I used “module.module.time_evaluator(“run”, dev, number=1, repeat=repeat)” to benchmark BERT models from huggingface and use config_threadpool to set the number of thread.

Use all cores in a big.LITTLE architecture

config_threadpool = remote.get_function('runtime.config_threadpool')
# affinity_mode: kBig = 1, kLittle = -1. kDefault = 0. pass 1 or -1 to control the cores
config_threadpool(affinity_mode, num_threads)

However, I found out two weird behaviors as the following figures show:

running with 4 small (42.57ms) outperforms running with 4 big cores (56.21ms).
running with 4 big and 4 small at the same time (37.12ms) outperforms running only with either 4 big or 4 small only (42.57ms).

Therefore, I referred to tvm benchmark wiki to see what other networks behave. I got a similar result in sqeueezet and mobilenet: Running with 4 small even outperform running with 4 big cores and running with 4 big and 4 small outperform running only with either 4 big or 4 small only.

The only exception is benchmarking “simple multilayer perception” whose behavior is more reasonable.

In simple multilayer perception, running with 4 big (14.42ms) outperforms running with 4 small cores (24.12ms). Running with 4 big and 4 small at the same time (14.82ms) barely outperforms running only with either 4 big or 4 small only (14.42ms) due to the communication cost of cores and cores.

Other info: I have run some simulations on TensorFlow lite benchmark and see the performance degrade when using all cores at the same time. (Fig2)

Does anyone have any thoughts on that? Thanks for your input in advance:)

popojames · October 14, 2021, 6:43pm

Kindly ask if there are any thoughts on this question. Running with small cores outperforms running with big cores makes no sense to me.

Thanks for community in advance.

AndrewZhaoLuo · October 15, 2021, 6:01am

Something like BERT or anything transformer is more likely to be memory rather than compute bound. My guess would be it has to do with CPU cache behavior or how your system memory works. I’m not sure though.

popojames · October 15, 2021, 1:07pm

@AndrewZhaoLuo Thanks for your replying.

Maybe I can try running with my desktop and see how it behaves. As we see, only the result of a simple “multilayer perception” is more reasonable, so I am expecting a relatively light CNN model such as mobilenet can still have a similar trend but it turns out not, it still makes me confused.

More thoughts are welcome Thanks, community.

popojames · October 22, 2021, 2:34pm

My desktop doesn’t have such Big and small cores, so I am not able to reproduce the result. I indeed saw when the number of cores increases, the performance will improve.

However, running on small clusters outperforming big clusters still makes no sense to me. Kindly ask if there are any thoughts on this question. Thanks.

popojames · October 26, 2021, 1:59pm

After re-building the entire TVM, and trying it again, I got the normal result now This time I use the new TVM builds and this is what I got

note: affinity_mode: kBig = 1, kLittle = -1. kDefault = 0. pass 1 or -1 to control the cores

H=512 L=12 BERT

affinity mode is: -1 , core number is: 2

Mean inference time (std dev): 530.63 ms (12.87 ms)

affinity mode is: -1 , core number is: 4

Mean inference time (std dev): 298.80 ms (2.79 ms)

affinity mode is: 1 , core number is: 2

Mean inference time (std dev): 227.32 ms (1.17 ms)

affinity mode is: 1 , core number is: 4

Mean inference time (std dev): 143.89 ms (4.97 ms)

affinity mode is: 0 , core number is: 5

Mean inference time (std dev): 165.22 ms (3.75 ms)

affinity mode is: 0 , core number is: 6

Mean inference time (std dev): 224.16 ms (3.47 ms)

affinity mode is: 0 , core number is: 8

Mean inference time (std dev): 181.72 ms (15.10 ms)

Mobilenet

affinity mode is: -1 , core number is: 2

Mean inference time (std dev): 294.93 ms (2.86 ms)

affinity mode is: -1 , core number is: 4

Mean inference time (std dev): 146.83 ms (1.45 ms)

affinity mode is: 1 , core number is: 2

Mean inference time (std dev): 97.59 ms (0.91 ms)

affinity mode is: 1 , core number is: 4

Mean inference time (std dev): 55.96 ms (1.05 ms)

affinity mode is: 0 , core number is: 5

Mean inference time (std dev): 112.20 ms (2.04 ms)

affinity mode is: 0 , core number is: 6

Mean inference time (std dev): 95.35 ms (1.53 ms)

affinity mode is: 0 , core number is: 8

Mean inference time (std dev): 76.14 ms (4.59 ms)