Can BYOC module and GPU module share buffers?

yogeesh · August 29, 2023, 6:40am

hi , i have a question regarding heterogeneous execution in TVM . mainly focused on BYOC experimentation.
Lets assume i have a BYOC module which will be executed on a accelerator (which is residing the GPU itself) which basically means that accelerator can use GPU mem.
Now by the looks of it, byoc module is executed from HOST side, hence all the inputs and outputs are transferred back to host once module is executed. But if there is a scenario like this:

GPU module ----> BYOC module

then currently , the output of GPU module is first copied to host and then BYOC uses this host buffer , whereas since accelerator is a part of GPU, BYOC module can directly use gpu output as input , hence totally avoiding the host / device copies.
after a lot of code follow through in tvm i am still unable to figure out if this is possible in current BYOC support in tvm , and if there is , how could we achieve it?
@sanirudh @tqchen @comaniac @zhiics any insights on this?
Any help is very much appreciated.

tqchen · August 29, 2023, 1:45pm

yes, i think so. since BYOC effectively offloads to an external packed func, that can take a NDArray which is used by GPU module.

The latest tvm unity BYOC under relax should be able to do this. also cc @sunggg

yogeesh · August 30, 2023, 6:23am

hi @tqchen thanks for your reply and insights on this.
however i would also like @sunggg and @vinx13 to comment on below understanding and whether we can achieve that in tvm unity BYOC.

according to my current understanding , all the inputs and outputs for BYOC module are picked up from host side , even if the original input and output arguments are residing in gpu from relay side of the code. Please correct me if i am wrong, But also after a very thorough code followthrough , i wasnt able to figure out where this copy from device->host for BYOC inp/outs happen , cause i wasnt able to find any internal cross device copy calls.
My primary goal out of this is to figure out how to optimize memory transactions in my BYOC module.
since right now we need to copy the input BYOC args from host to custom hardware memory (same as GPU mem) , and then transfer the output back from device to host. which can be avoided if we can just use the GPU residing args , and i would like to know two things :

whether is this achievable in current tvm RELAY byoc.?
and as @tqchen mentioned this is doable in TVM UNITY BYOC , so would like to know a general direction of achieving this TVM unity BYOC.

tqchen · August 30, 2023, 1:33pm

The inputs are picked up from host side but these variables are likely NDArray, whose data ptr refers to the device side memory. So we are not really copying data through host

yogeesh · August 30, 2023, 6:11pm

hi @tqchen
thanks for the clarification , i will go through the code flow once again and confirm the above understanding. If any doubt will post here.