[RFC] Direct Memory Access of Data from Backend for iGPU

wkwchau · March 15, 2021, 7:54pm

Based on my understanding of the TVM Runtime, the data of input and parameter nodes are copied from host memory to device memory and the output data are copied back to the host memory before and after inference, respectively. With iGPU, such as AMD APU devices, memory for CPU and GPU are coming from the same pool. It is possible to eliminate the copying by direct access the host memory from an iGPU device. To do so, we need to modify the APIs of TVM Runtime for setting input, params and getting output data. I am thinking of two options:

Define new functions to allow backend to access the host memory directly, for example

graph_runtime.GraphModule.set_hostptr_input(key=None, value=None, **params) graph_runtime.GraphModule.get_hostptr_output(index, out=None)
Modify the existing APIs of set_input() and get_output() by adding a “mem” parameter, which has value of “host” or “device”. With the default value of “device”, no change is required for any of existing applications.

graph_runtime.GraphModule.set_input(key=None, value=None, mem=”device” **params) graph_runtime.GraphModule.get_output(index, mem=”device”, out=None)

The current approach of copying data to device memory has the call-by-value semantics, i.e. the backend operation has no impact on the original data. In contrast, direct access of host memory data in the backend has call-by-reference semantics; any modification of data in the backend could be seen by the frontend.

Since only system with unified memory can benefit from using host memory approach and the data could be potentially modified by the backend operations, it may make sense to have special APIs to allow direct memory access of the data. However, adding new APIs in the TVM Runtime to passing data to the backend may not be desired. I am seeking comments/suggestion of how to proceed to support direct access of host memory from the backend.

BTW, please let me know if there is any on-going effort already working on this area.

tqchen · March 15, 2021, 10:40pm

Thanks @wkwchau ! Unified memory is indeed something that can be quite interesting. This is something that might be useful to discuss from the base runtime(NDArray)'s setting.

In particular, we could think about introducing a runtime API such as TVMDeviceGetHostPtr, that returns the corresponding host pointer for a given device pointer(if available) then build up NDArray that creates a host “view” of the device array.

In this way we do not need to change through the graph runtime API. It might also be useful to think about possible implications for hetrogenous exec, where some of the ops are placed on device and others on host. We might also need to think about the case where cache memory is not consistent and we will need read/write barrier(or cache flush/invalidation operations) explicitly for the effect on CPU to be seen on the GPU side.

wkwchau · March 16, 2021, 6:47pm

@tqchen Thanks for your comment.

The reason I suggest the changes through the graph runtime API is because the data transfer are done in set_input() and get_output(). Introducing a new runtime API makes sense. Note that we have to find a way to notify graph runtime to use host memory pointers instead of copying data to device memory for the input/param nodes, and allocate host memory for the output nodes. Do you think adding an attribute, eg. use_host_ptr, in runtime TVMContext a viable solution?

tqchen · March 16, 2021, 6:57pm

assuming the coherent cache setting, i can imagine we have the following API


data_on_device : tvm.NDArray = grt.get_input("data")
# another array where the pointer is the host version of the ptr, API is tentative
data_host_view = data_on_device.get_host_view()
assert data_host_view.device == cpu(0)
# we can then directly get write to the data inside host view if needed
# we can further export to dlpack and import as other array that support
# dlpack exchange
# This is a cpu memcpy
input_data.copyto(data_host_view)

I am not too sure about possible gains we can get through this kind of API in the input only setting though. There could be one save of memory copy

wkwchau · March 16, 2021, 9:43pm

@tqchen I am new to TVM and not sure I understand your comment correctly.

Do you mean to use data_on_device.get_host_view() to get the host pointer associated with the “data” object in device memory? If so, the input data is already copied to the device memory, which is what I am trying to avoid. I am thinking of a way to pass the host pointers of input/param nodes to the device so that it can directly access the host memory without any copying. I know that it can be done in Vulkan backend by using VK_EXT_external_memory_host extension; not sure whether it is possible for other backends.

Please correct me if my interpretation is incorrect.

tqchen · March 16, 2021, 11:43pm

Thanks @wkwchau I can get what you mean.

The above example would allow us to directly write into data_host_view's memory that get reflected in the graph runtime’s input. We can for example then run

# writes to CPU memory
data_host_view[0] = some_value

or call

some_opaque_host_func(host_ptr=data_host_view.data)

We can also directly get the output as an host array, then perform followup reads.

The current graph runtime interface indeed is designed in a way such that a copy or direct write into the input data is preferred. However, for most models this should be OK as the one time copy cost is not as expensive.

The host-device unified memory would benefit more if there are frequently CPU GPU interactions, e.g. in the middle of the graph, invoking some ops using GPU then immediately feed the outcome of that op into a CPU(without copy). The HostPtr mechanism should enable that

wkwchau · March 17, 2021, 10:21pm

@tqchen Thanks for the clarification. I agree with you that the host pointer mechanism could provide more benefit for heterogeneous devices model execution. This is the strategy we are also interested in as well.

Due to constraint of our time budget, we would like to save as much time as possible. Based on our preliminary study of our application, we can save about 1-2 ms and 7+ ms data transfer time by avoiding the copying for the input and output nodes, respectively.

Let me explain the use case that driven us to look into the direct memory access approach. We have an application that owns the input and output data buffers that resided in host memory and would like to pass them to TVM for inference. To avoid data transfer, we are looking for a way to directly access the input/output data buffers in the backend, i.e. the backend takes the pointers of the pre-allocated host memory and used them for the input/output nodes. Since the current graph runtime interface does not support this, we have to make some changes. The idea in my first post is based on this usage.

tqchen · March 17, 2021, 11:33pm

Get that, we already have set_input_zero_copy

github.com

apache/tvm/blob/main/src/runtime/graph/graph_runtime.cc#L118


void GraphRuntime::SetInput(int index, DLTensor* data_in) {
  ICHECK_LT(static_cast<size_t>(index), input_nodes_.size());
  uint32_t eid = this->entry_id(input_nodes_[index], 0);
  data_entry_[eid].CopyFrom(data_in);
}
/*!
 * \brief set index-th input to the graph without copying the data.
 * \param index The input index.
 * \param data_ref The input data that is referred.
 */
void GraphRuntime::SetInputZeroCopy(int index, DLTensor* data_ref) {
  ICHECK_LT(static_cast<size_t>(index), input_nodes_.size());
  uint32_t eid = this->entry_id(input_nodes_[index], 0);
  const DLTensor* old_t = data_entry_[eid].operator->();
  // check the consistency of input
  ICHECK_EQ(data_alignment_[eid], details::GetDataAlignment(*data_ref));
  ICHECK_EQ(reinterpret_cast<size_t>(data_ref->data) % kAllocAlignment, 0);
  ICHECK_EQ(old_t->ndim, static_cast<size_t>(data_ref->ndim));
  ICHECK_EQ(old_t->ctx.device_type, data_ref->ctx.device_type);
  ICHECK_EQ(old_t->ctx.device_id, data_ref->ctx.device_id);

That means if we have an API from DeviceAPI that can turn a host memory into a device memory, then we should be able to make use of that. So perhaps introduce TVMDevicePtrFromHost function for those that supports it. We can then construct a host side DLTensor, turn that into a device one and set into the API using zero copy

wkwchau · March 22, 2021, 8:51pm

Thanks for the pointer. I came across the SetInputZeroCopy before, but just don’t know how to use it. I am going to think about how to follow your suggestions with a new API from DeviceAPI and avoid modifying the graph runtime APIs.