Add option to enable CUDA stream per thread in tvm runtime

May I ask if there is a follow up PR about this function?

I also want to ask what is the easiest way to run two kernels on a GPU concurrently in TVM.

Thanks a lot!