I’m wondering if we can add an option to tvm to enable CUDA stream per thread since CUDA 7 (GPU Pro Tip: CUDA 7 Streams Simplify Concurrency | NVIDIA Developer Blog). It will automatically improve the throughput for cases that we run multiple tvm processes (common in cloud deployment), and have no behavior change to other single process deployment scenarios. If this makes sense, I’ll send out a PR to enable this option.
Are u saying the stream number for the per thread CUDA default stream is 1? I ran a 3 thread concurrent TVM runtime with the stream-per-thread enabled, each process seems to get different stream number. Or do you have documentation for it?
I still think be useful to still follow the most legacy behavior of NVCC, mainly to make sure things are consistent with existing softwares, and can impact the behavior during say data exchange.