Hi, I am having a problem regarding thread pool performance on Windows. My machine is a dual socket Xeon x5650 (12 core 12 Thread).
This PR by @yidawang is supposed to improve multi-threading performance. But I found that although that PR brought CPU usage to 100%, the overall inference time didn’t get faster than the original thread pool. The CPU usage with the original thread pool is around 75%.
Here is the VTune result with the original thread pool. The CPU utilization is not ideal, but there seems to be no overhead from multi threading.
Here is the VTune result with the thread pool from the latest TVM. Although the CPU util becomes much better, the elapsed time is almost identical with the result above. VTune is reporting high multi-threading overhead.
I have following questions:
- Did the PR by @yidawang actually improve end to end performance on high core count Linux machines?
- Does similar multi-threading overhead exist on high core count Linux machines?
- Can we replace TVM’s thread pool implementation with Microsoft’s concurrency::parallel_for(), which is readily available on Windows ?