I have to compute mainly leigthweight inference tasks in a multi threaded environment. Triggered from different CPU threads I would like to execute these inference tasks concurrently on my CUDA device. CUDA supports so called streams per thread. I tried to set TVM to use streams per thread. But as I learned from the nvidia/CUDA profiling, I failed. All CUDA calls are executed sequentially in one cuda stream. This is what I tried at the very beginnning of my C++ main function:
DLDevice tvm_device_id = {kDLCUDA, 0};
auto tvm_device = tvm::runtime::DeviceAPI::Get(tvm_device_id);
auto cvThreadDefaultStream = CU_STREAM_PER_THREAD;
tvm_device->SetStream(tvm_device_id, &cvThreadDefaultStream);
Did I get something wrong? What is the right solution?
Thanks ds