I found through debugging that when I use a compiled .so Module in C++, tvm::runtime::ThreadPool does automatically create num_workers worker threads. However, only one of these threads is consistently executing a certain function in the .so.
Are there any specific Passes that can make the kernel generated by TVM run in parallel?
For example, in my scenario, a typical computing kernel function is fused_matmul16_add7_multiply7_add10
. Can it automatically use multithreading for parallelism?
Thanks!