CUDA streams per thread

I have to compute mainly leigthweight inference tasks in a multi threaded environment. Triggered from different CPU threads I would like to execute these inference tasks concurrently on my CUDA device. CUDA supports so called streams per thread. I tried to set TVM to use streams per thread. But as I learned from the nvidia/CUDA profiling, I failed. All CUDA calls are executed sequentially in one cuda stream. This is what I tried at the very beginnning of my C++ main function:

    DLDevice tvm_device_id = {kDLCUDA, 0};
    auto tvm_device = tvm::runtime::DeviceAPI::Get(tvm_device_id);
    auto cvThreadDefaultStream = CU_STREAM_PER_THREAD;
    tvm_device->SetStream(tvm_device_id, &cvThreadDefaultStream);

Did I get something wrong? What is the right solution?

Thanks ds

After quite some research I understood what was missing:

DLDevice tvm_device_id = {kDLCUDA, 0};
auto tvm_device = tvm::runtime::DeviceAPI::Get(tvm_device_id);
cudaSetDevice(tvm_device_id.device_id);
m_cudaStream = CU_STREAM_PER_THREAD;
cudaStreamCreate(&m_cudaStream);
//cudaStreamCreateWithFlags(&m_cudaStream, cudaStreamNonBlocking);
tvm_device->SetStream(tvm_device_id, m_cudaStream);

I recommend not to use cudaStreamNonBlocking in this context, since I encounter a problem, which is still open. Please note that the data transfer from CPU to GPU and back still relies on the default cuda stream That is another issue.