Based on my understanding of the TVM Runtime, the data of input and parameter nodes are copied from host memory to device memory and the output data are copied back to the host memory before and after inference, respectively. With iGPU, such as AMD APU devices, memory for CPU and GPU are coming from the same pool. It is possible to eliminate the copying by direct access the host memory from an iGPU device. To do so, we need to modify the APIs of TVM Runtime for setting input, params and getting output data. I am thinking of two options:
-
Define new functions to allow backend to access the host memory directly, for example
graph_runtime.GraphModule.set_hostptr_input(key=None, value=None, **params) graph_runtime.GraphModule.get_hostptr_output(index, out=None)
-
Modify the existing APIs of set_input() and get_output() by adding a “mem” parameter, which has value of “host” or “device”. With the default value of “device”, no change is required for any of existing applications.
graph_runtime.GraphModule.set_input(key=None, value=None, mem=”device” **params) graph_runtime.GraphModule.get_output(index, mem=”device”, out=None)
The current approach of copying data to device memory has the call-by-value semantics, i.e. the backend operation has no impact on the original data. In contrast, direct access of host memory data in the backend has call-by-reference semantics; any modification of data in the backend could be seen by the frontend.
Since only system with unified memory can benefit from using host memory approach and the data could be potentially modified by the backend operations, it may make sense to have special APIs to allow direct memory access of the data. However, adding new APIs in the TVM Runtime to passing data to the backend may not be desired. I am seeking comments/suggestion of how to proceed to support direct access of host memory from the backend.
BTW, please let me know if there is any on-going effort already working on this area.