CUDA async cudaMemcpyAsync/cudaMallocHost

This is an interesting proposal. We can certainly introduce a cpu_pinned device. Perhaps we can set a different device type as in https://github.com/dmlc/dlpack/blob/0acb731e0e43d15deee27b66f10e4c5b4e667913/include/dlpack/dlpack.h#L47

and add a separate DeviceAPI for these types of memory.