How to use CPU multithreading for data parallelism?

I found through debugging that when I use a compiled .so Module in C++, tvm::runtime::ThreadPool does automatically create num_workers worker threads. However, only one of these threads is consistently executing a certain function in the .so.

Are there any specific Passes that can make the kernel generated by TVM run in parallel?

For example, in my scenario, a typical computing kernel function is fused_matmul16_add7_multiply7_add10. Can it automatically use multithreading for parallelism?

Thanks!

I’d like to add that through static_shape_tuning, I obtained a prim_func with many T.parallel representations. This allowed my .so to successfully utilize multi-threading acceleration, and I can easily adjust the number of threads by modifying the TVM_NUM_THREADS environment variable.

However, my real model is one with dynamic input shapes, which cannot directly utilize static_shape_tuning. I would like to know how I can enable multi-threading for my model in this situation?