Use all cores in a big.LITTLE architecture

Wheest · November 17, 2020, 1:23pm

I’ve got 4 LITTLE cores and 4 big cores. In other code I’ve written for these platforms, I’ve been able to use all 8 cores, to observe interesting behaviour.

I’ve looked at this thread, and opinion seems to be mixed, though @eqy seems to think it’s possible. A linked thread suggests that having code such as:

        if self.big_little:
            config_func = self.session.get_function('runtime.config_threadpool')
            # config_func(1, 1) # use 1 big core
            # config_func(1, 2) # use 2 big cores
            #config_func(-1, 1) # use 1 small core
            # config_func(-1, 2) # use 2 small cores
            config_func(4, 4)

Might work, however it has not for me.

This thread discusses binding of threads, but playing around with the environment variables has not changed my behaviour.

As the APIs have developed since prior threads, is there now a more canonical way of doing this? I’m not worried about clever load balancing for now, I would just like to run with 8 threads.

popojames · October 8, 2021, 5:56am

Does anyone have any idea about this question?

FrozenGene · October 9, 2021, 4:55am

I don’t think it is a good way to go. When to use big.little cores, we will get worse performance as cross big little arch.

popojames · October 9, 2021, 9:22pm

Hello @FrozenGene. I do agree with we will get worse performance when using all cores.

I have run some simulations on TensorFlow lite benchmark and see the performance when using all cores. (Fig2)

But I am wondering how can I adjust the numbers of the thread. I have checked tvm/src/runtime/threading_backend.cc file and find out the default setting is using 4 big cores. I have tried to adjust the numbers of the thread (e.g. only using 1 small core, or only using 3 big cores.) but it seems the inference still using 4 cores.

Thanks.

FrozenGene · October 10, 2021, 12:45pm

I assume you have got ‘remote’ handle correctly. Then we could get the func:

config_threadpool = remote.get_function('runtime.config_threadpool')
# affinity_mode: kBig = 1, kLittle = -1. kDefault = 0. pass 1 or -1 to control the cores
config_threadpool(affinity_mode, num_threads)

popojames · October 10, 2021, 10:03pm

Hello @FrozenGene

Thanks for your answering, I was able to set the numbers of the thread through remote now. (I use my own desktop and use the remote command to send the network to Hikey970, then I used “htop” to check the corresponding CPU assignment in hikey 970 and all is good)

Code of running_simulation.py in my desktop

However, I am wondering if it is possible to set the number of cores in Hikey970 locally without using remote. (i.e. running locally with remote_info = None in the following function)

Thanks for your help

popojames · October 10, 2021, 10:03pm

Also, I found the performance is actually getting better when using big and small cores at the same time. During the simulation, I use “htop” to check the number of threads, and

Here are the simulation results with [1/2/3/4 big cores] and [4 big +1/2/3/4 small cores]

This is different than what I get in benchmarking in tensorflow lite.

Any thoughts on that? Thanks for your help

FrozenGene · October 11, 2021, 12:34am

No. Because runtime is in the remote and default bahaviour is run all big cores, you should use it to control the cores. When you deploy it in devices in C++ if in production, you could use C++ api in the app to control it.

FrozenGene · October 11, 2021, 12:40am

Please make sure you are using 8 cores, because our thread pool will check core numbers. When affinity mode is default (0), if your board only have 4 cores, when you set 5, 6 or what else great than 4, we should only use 4 cores. The inference time is different maybe because of unstable measurement for example other apps using cores or you are running inference too few times. Consider running 200 times. We have time_evaluator function utility to do it.

popojames · October 11, 2021, 1:04am

Hello @FrozenGene Thanks for your instant reply,

I am using hikey 970 which contains 8 cores (4 big cores and 4 small cores), so I think I did have 8 cores. Also, I used “htop” to check the number of threads and make sure my setting is correct. For example, on the left-hand side (my own desktop) I am running with remote with only 1 big core, on the left-hand side (hikey 970) receive the command and doing the benchmark, I am using htop to show hikey 970 indeed only using one big core number (core number 6, which is a big core)

I have already using time_evaluator and setting repeat =50 (I think it’s big enough, which is the same setting as what I set when I did inference in the TensorFlow lite), but I still got the following result.

affinity mode is:  1 , core number is: 4
Mean inference time (std dev): 56.44 ms (0.36 ms)

affinity mode is:  0 , core number is: 1
Mean inference time (std dev): 179.23 ms (0.23 ms)
affinity mode is:  0 , core number is: 2
Mean inference time (std dev): 97.08 ms (0.14 ms)
affinity mode is:  0 , core number is: 3
Mean inference time (std dev): 74.11 ms (0.04 ms)
affinity mode is:  0 , core number is: 4
Mean inference time (std dev): 56.69 ms (0.90 ms)
affinity mode is:  0 , core number is: 5
Mean inference time (std dev): 54.50 ms (0.63 ms)
affinity mode is:  0 , core number is: 6
Mean inference time (std dev): 45.72 ms (1.45 ms)
affinity mode is:  0 , core number is: 7
Mean inference time (std dev): 45.23 ms (2.40 ms)
affinity mode is:  0 , core number is: 8
Mean inference time (std dev): 37.28 ms (2.27 ms)

As the result shows, running with 4 big and 4 small outperform than running only with 4 big cores.

popojames · October 11, 2021, 1:23am

Sorry, I am using the following python code to run my simulation and I am not quite familiar with the backend settings with C++. I did go over thread_pool.cc and threading_backend.cc, but I am still wondering is it possible to utilize or call “TVM_REGISTER_GLOBAL(“runtime.config_threadpool”)” or “Configure(AffinityMode mode, int nthreads, bool exclude_worker0)” function in python as we did in the remote setting. (i.e. remote.get_function(‘runtime.config_threadpool’))

github.com

apache/tvm/blob/322aad5b7cee4bcbeaac15de7bd7ac7cec675ee4/src/runtime/threading_backend.cc#L63


  InitSortedOrder();
}
~Impl() { Join(); }


void Join() {
  for (auto& t : threads_) {
    if (t.joinable()) t.join();
  }
}


int Configure(AffinityMode mode, int nthreads, bool exclude_worker0) {
  int num_workers_used = 0;
  if (mode == kLittle) {
    num_workers_used = little_count_;
  } else if (mode == kBig) {
    num_workers_used = big_count_;
  } else {
    // use default
    num_workers_used = threading::MaxConcurrency();
  }
  // if a specific number was given, use that

github.com

apache/tvm/blob/322aad5b7cee4bcbeaac15de7bd7ac7cec675ee4/src/runtime/thread_pool.cc#L372


  }
  int num_workers_;
  // number of workers used (can be restricted with affinity pref)
  int num_workers_used_;
  // if or not to exclude worker 0 and use main to run task 0
  bool exclude_worker0_{true};
  std::vector<std::unique_ptr<SpscTaskQueue> > queues_;
  std::unique_ptr<tvm::runtime::threading::ThreadGroup> threads_;
};


TVM_REGISTER_GLOBAL("runtime.config_threadpool").set_body([](TVMArgs args, TVMRetValue* rv) {
  threading::ThreadGroup::AffinityMode mode =
      static_cast<threading::ThreadGroup::AffinityMode>(static_cast<int>(args[0]));
  int nthreads = args[1];
  ThreadPool::ThreadLocal()->UpdateWorkerConfiguration(mode, nthreads);
});


namespace threading {
void ResetThreadPool() { tvm::runtime::ThreadPool::ThreadLocal()->Reset(); }
}  // namespace threading

FrozenGene · October 11, 2021, 3:25am

As you show, apache/tvm/blob/322aad5b7cee4bcbeaac15de7bd7ac7cec675ee4/src/runtime/threading_backend.cc#L63

We will restrict the usage cores to the max number of cores we could use. Could you double check？

popojames · October 11, 2021, 8:18pm

Hello @FrozenGene, Yes, I knew there is “MaxConcurrency()” function in the backend.cc to control the max number of cores we could use.

What I did: export TVM_NUM_THREADS =8 in Hikey 970 since it has 8 cores. Afterwards, I run with following code: config_threadpool(0, 8) ftimer = module.module.time_evaluator(“run”, dev, repeat=50, min_repeat_ms=500)

I use htop and indeed the benchmark is using all of 8 cores (CPU utilization ~=800%)

I have run several times and the result I got is consistent.

running with 4 small even outperform running with 4 big cores.
running with 4 big and 4 small outperform running only with 4 big cores.

Both are weird to me…