Hi,
I’m wondering if we can add an option to tvm to enable CUDA stream per thread since CUDA 7 (GPU Pro Tip: CUDA 7 Streams Simplify Concurrency | NVIDIA Developer Blog). It will automatically improve the throughput for cases that we run multiple tvm processes (common in cloud deployment), and have no behavior change to other single process deployment scenarios. If this makes sense, I’ll send out a PR to enable this option.