[MultiThread][ThreadPool] Performance degradation when running relay module in multiple threads

  • Device: Skylake 8163 with 48 pysical cores

  • Env Setting: TVM_BIND_THREADS=0 TVM_NUM_THREADS=4

  • Code Snippet:

module = graph_executor.GraphModule(lib["default"](ctx))
def thread_run:
   for i in range(repeats):
       module.run()
threads = []
   for i in range(num_threads):
      threads.append(PropagatingThread(
           target=process_run,
      ))

Since there are still 4 physical cores occupied by each thread in 2-threaded run, the performace is expected to be the same as single-threaded run. But the performace of each thread in 2-threaded run actually is only as 50% of single-threaded run, and 4 -threaded run is only as 25% and so on…

Any idea about the performance degradation in multi-threaded run?

1 Like

Help to at if anyone have experience on this

@tqchen @junrushao @comaniac @FrozenGene :smiley:

How about setting one environment TVM_THREAD_POOL_SPIN_COUNT = 0 , does it improve your case?

Thanks. It doesn’t improve. Performace in both single-threaded run and multi-threaded run drop 10-20%, and CPU utilization drops from nearly 100% to nearly 50% with TVM_THREAD_POOL_SPIN_COUNT=0, which I think is reasonable.

Add an addtional info: in multi-processed run, the performace of module.run() in each process does not drop.

if we change the thread pool to OpenMP, do we have the same problem?

Yes, it also has the same problem running on omp as threading backend. We didn’t see much difference between TVM ThreadPool and OMP in such cases.

Do you mean if we use Python’s multi process, you could get ideal result but use Python’s multi thread, you get bad result? Or what else things you mean?

By “multi-prcocess” I mean run the python script two times simutaniously, while “multi-thread” is implemented by threading.Thread in python script.

I found out that it might the problem of python thread.Thread instead of TVM.

Threads in python aren’t actually executed concurrently due to the GIL. So the reason everything is slower is because you aren’t actually doing anything in parallel. Also, it could be a lot of you time is actually spent in the python interpreter instead of executing you model. You should try using time evaluator (tvm.runtime — tvm 0.8.dev0 documentation) instead of your own python loop.

2 Likes

As @tkonolige mentioned, never ever assume python’s multi-threading could help in most of the cases >_<

Got it, thanks! @tkonolige @junrushao