Graph Executor Zero Copy with CUDA Pinned Memory

coppock · October 30, 2023, 12:30pm

I’m trying to use CUDA pinned memory allocated through cuMemAllocHost() with the graph executor in order to limit memory copy overhead.

I’m allocating the tensors with device type tvm.runtime.Device.kDLCUDAHost and then setting input and outputs with GraphModule.set_{in,out}put_zero_copy(). When running the code, I get an assertion error from GraphExecutor::CheckExternalDLTensor(), specifically when the alignment is checked. Even if I get past this assertion, I’m expected the next few assertions may fail when the device type of the tensor is found not to match the device type of the model.

Is there a recommended way to allow pinned tensors to be used for inputs and outputs? I’m thinking I may modify GraphExecutor::CheckExternalDLTensor() to allow tensors with device type tvm.runtime.Device.kDLCUDAHost no matter their device id or alignment as long as the model is on a CUDA device.

On the other hand, I thought memory allocated through cudaMallocHost() would necessarily be page-aligned. Not sure why it’s only 16-byte aligned. See following traceback.

Traceback (most recent call last):
  File "/opt/app/benchmark.py", line 69, in <module>
    module.set_input_zero_copy(key='input_tensor:0', value=input)
  File "/usr/local/lib/python3.10/dist-packages/tvm/contrib/graph_executor.py", line 228, in set_input_zero_copy
    self._set_input_zero_copy(key, value)
  File "/usr/local/lib/python3.10/dist-packages/tvm/_ffi/_ctypes/packed_func.py", line 239, in __call__
    raise_last_ffi_error()
  File "/usr/local/lib/python3.10/dist-packages/tvm/_ffi/base.py", line 476, in raise_last_ffi_error
    raise py_err
tvm.error.InternalError: Traceback (most recent call last):
  2: tvm::runtime::PackedFuncObj::Extractor<tvm::runtime::PackedFuncSubObj<tvm::runtime::GraphExecutor::GetFunction(tvm::runtime::String const&, tvm::runtime::ObjectPtr<tvm::runtime::Object> const&)::{lambda(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)#2}> >::Call(tvm::runtime::PackedFuncObj const*, tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)
  1: tvm::runtime::GraphExecutor::SetInputZeroCopy(int, DLTensor*)
  0: tvm::runtime::GraphExecutor::CheckExternalDLTensor(DLTensor const*, unsigned int) const
  File "/opt/tvm/src/runtime/graph_executor/graph_executor.cc", line 194
InternalError: Check failed: reinterpret_cast<size_t>(static_cast<char*>(external->data) + external->byte_offset) % kAllocAlignment == 0 (16 vs. 0) :

coppock · November 4, 2023, 4:34pm

I think I’ve solved the issue.

Using set_{in,out}put_zero_copy() with host pinned memory requires CUDA Unified Memory (unified memory address space). While recent NVIDIA devices support this, TVM doesn’t yet. For my use case, I didn’t need to use the zero-copy interfaces; the copy to device memory is fine. My code succeeded when I used the basic set_input() and get_output() interfaces.