How to reduce the CPU utilization?

When using TVM v0.7 C++ API to do inference in llvm target CPU, I set the “TVM_NUM_THREADS=16” as there are 16 logical cores, and then ran a benchmark test script which launched 2 std::thread, each one runs a loop of 1000 synchronous inference run call. The 16 CPUs usage are 100%.

But when I run the same case in TensorFlow. 16 CPUs are about 40% for earch.

I also read that “TVM_BIND_THREADS =1” sets the CPU affinity, however, it seems it does not have any impact when I set it (TVM_BIND_THREADS=1) or unset it (TVM_BIND_THREADS=0).

How can I set/tune these parameters? TVM_NUM_THREADS/TVM_BIND_THREADS/OMP_NUM_THREADS

I searched some similar topics, but didn’t get correct answer to this problem.

https://discuss.tvm.apache.org/t/how-to-manually-controll-cpu-affinity-in-multithreading-scenario/10348

https://discuss.tvm.apache.org/t/setting-per-core-usage-explicitly-in-tvm/4538

https://discuss.tvm.apache.org/t/do-tvm-runtime-threads-occupy-cpu-persistently-how-to-sleep-them-in-time/6178

Could you please shed some lights here? Thanks.

TensorFlow has the concept of intra_op_parallelism_threads and inter_op_parallelism_threads.

But TVM seems only support intra op threads. Then if there are a bunch of predict requests incoming, how TVM can process the requests in parallel? TVM executes the graph in sequential way based on GraphExecutor::run(). Will it yield CPU when executing one graph?

Hyper-threading works well if you have tasks (threads or processes) of different nature allowing occupy different parts of GPU. If you have tasks of the same type, it’s better to limit by the number of cores. I.e. in your case it seems should be 8. You mentioned the way how to do this in the first your sentence:

export TVM_NUM_THREADS=8

BTW, TVM has this logic by default and it tries to reduce number of cores twice on x86 processors here, if you don’t have more advanced logic, you don’t have to set up TVM_NUM_THREADS

The problem still exists. IMHO, TF has the inter_op threads, for each graph, it can handle several ops in parallel. But TVM process ops in sequential manner, there’s no inter op concept. So in my case, we run 4 user threads, each thread runs 1000 predict request, The backend setting is as following: CPU:16C TF: inter_op=4, intra_op=8, TVM: OMP_NUM_THREADS=8, TF shows much better performance than TVM in terms of latency(each predict) and CPU usage.

My question is, given TVM has no inter-op parallel, how can TVM beat TF, which has inter-op parallel, in one single compute graph inference?

Augh, got what you meant. No, TVM does not have out of order execution, it executes operations in the only one order been pointed in json file. To execute in parallel several branches the TVM should be extended.

On the other hand situation can be improved for some usage models right now with current TVM. By fact there are two flows

  • Best latency - when you want to reduce time of one inference
  • Best throughput - when you want to process the maximum number of input data

The approach with branches parallelism definitely can help to decrease inference time, but if you want to process the most data, you can run several inference in parallel limiting hardware resources for each. The reason why several workers can work better than one running sequentially is the same as in case of running several ops in parallel in TF - poor concurrency of layers. It is often better for example to run 2 instances in parallel with utilization of 4 cores per each instead of running one inference utilizing 8 cores.

The exact number of inferences and cores per inference can be selected only experimentally for certain network.

1 Like