I’m trying to use CUDA pinned memory allocated through cuMemAllocHost() with the graph executor in order to limit memory copy overhead.
I’m allocating the tensors with device type tvm.runtime.Device.kDLCUDAHost
and then setting input and outputs with GraphModule.set_{in,out}put_zero_copy()
. When running the code, I get an assertion error from GraphExecutor::CheckExternalDLTensor()
, specifically when the alignment is checked. Even if I get past this assertion, I’m expected the next few assertions may fail when the device type of the tensor is found not to match the device type of the model.
Is there a recommended way to allow pinned tensors to be used for inputs and outputs? I’m thinking I may modify GraphExecutor::CheckExternalDLTensor()
to allow tensors with device type tvm.runtime.Device.kDLCUDAHost
no matter their device id or alignment as long as the model is on a CUDA device.
On the other hand, I thought memory allocated through cudaMallocHost() would necessarily be page-aligned. Not sure why it’s only 16-byte aligned. See following traceback.
Traceback (most recent call last):
File "/opt/app/benchmark.py", line 69, in <module>
module.set_input_zero_copy(key='input_tensor:0', value=input)
File "/usr/local/lib/python3.10/dist-packages/tvm/contrib/graph_executor.py", line 228, in set_input_zero_copy
self._set_input_zero_copy(key, value)
File "/usr/local/lib/python3.10/dist-packages/tvm/_ffi/_ctypes/packed_func.py", line 239, in __call__
raise_last_ffi_error()
File "/usr/local/lib/python3.10/dist-packages/tvm/_ffi/base.py", line 476, in raise_last_ffi_error
raise py_err
tvm.error.InternalError: Traceback (most recent call last):
2: tvm::runtime::PackedFuncObj::Extractor<tvm::runtime::PackedFuncSubObj<tvm::runtime::GraphExecutor::GetFunction(tvm::runtime::String const&, tvm::runtime::ObjectPtr<tvm::runtime::Object> const&)::{lambda(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)#2}> >::Call(tvm::runtime::PackedFuncObj const*, tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)
1: tvm::runtime::GraphExecutor::SetInputZeroCopy(int, DLTensor*)
0: tvm::runtime::GraphExecutor::CheckExternalDLTensor(DLTensor const*, unsigned int) const
File "/opt/tvm/src/runtime/graph_executor/graph_executor.cc", line 194
InternalError: Check failed: reinterpret_cast<size_t>(static_cast<char*>(external->data) + external->byte_offset) % kAllocAlignment == 0 (16 vs. 0) :