[MultiThread][ThreadPool] Performance degradation when running relay module in multiple threads

MinminSun · July 2, 2021, 7:53am

Device: Skylake 8163 with 48 pysical cores
Env Setting: TVM_BIND_THREADS=0 TVM_NUM_THREADS=4
Code Snippet:

module = graph_executor.GraphModule(lib["default"](ctx))
def thread_run:
   for i in range(repeats):
       module.run()
threads = []
   for i in range(num_threads):
      threads.append(PropagatingThread(
           target=process_run,
      ))

When num_threads=1
A1F69032-BCEC-46B1-8118-493FD1CB2F4A3488×1232 419 KB
There are 4 physical cores are occupied by TVM thread pool. Each module.run() taskes 4ms.
When num_threads=2
7A20D611-CD6E-4979-9FEB-F4513845C1B73482×1240 452 KB
There are 8 physical cores are occupied by TVM thread pools. Each module.run() taskes 8ms.

Since there are still 4 physical cores occupied by each thread in 2-threaded run, the performace is expected to be the same as single-threaded run. But the performace of each thread in 2-threaded run actually is only as 50% of single-threaded run, and 4 -threaded run is only as 25% and so on…

Any idea about the performance degradation in multi-threaded run?

jcf94 · July 2, 2021, 8:02am

Help to at if anyone have experience on this

@tqchen @junrushao @comaniac @FrozenGene

FrozenGene · July 2, 2021, 8:11am

How about setting one environment TVM_THREAD_POOL_SPIN_COUNT = 0 , does it improve your case?

MinminSun · July 2, 2021, 8:18am

Thanks. It doesn’t improve. Performace in both single-threaded run and multi-threaded run drop 10-20%, and CPU utilization drops from nearly 100% to nearly 50% with TVM_THREAD_POOL_SPIN_COUNT=0, which I think is reasonable.

MinminSun · July 2, 2021, 8:21am

Add an addtional info: in multi-processed run, the performace of module.run() in each process does not drop.

FrozenGene · July 2, 2021, 8:24am

if we change the thread pool to OpenMP, do we have the same problem?

MinminSun · July 2, 2021, 8:38am

Yes, it also has the same problem running on omp as threading backend. We didn’t see much difference between TVM ThreadPool and OMP in such cases.

FrozenGene · July 2, 2021, 8:57am

Do you mean if we use Python’s multi process, you could get ideal result but use Python’s multi thread, you get bad result? Or what else things you mean?

MinminSun · July 2, 2021, 9:48am

By “multi-prcocess” I mean run the python script two times simutaniously, while “multi-thread” is implemented by threading.Thread in python script.

MinminSun · July 5, 2021, 9:20am

I found out that it might the problem of python thread.Thread instead of TVM.

tkonolige · July 6, 2021, 7:55pm

Threads in python aren’t actually executed concurrently due to the GIL. So the reason everything is slower is because you aren’t actually doing anything in parallel. Also, it could be a lot of you time is actually spent in the python interpreter instead of executing you model. You should try using time evaluator (tvm.runtime — tvm 0.8.dev0 documentation) instead of your own python loop.

junrushao · July 8, 2021, 10:14pm

As @tkonolige mentioned, never ever assume python’s multi-threading could help in most of the cases >_<

MinminSun · July 10, 2021, 1:31pm

Got it, thanks! @tkonolige @junrushao