I have a abstract question on optimization of BYOC generated C code for custom codegen .
Essentially , i need the optimize the memory buffer management for the custom hardware. But current BYOC implementation in TVM (as far as i know), assumes that the custom codegen module is running on host (which is correct) ,
but this makes tvm to assume that input params and results are also sitting in the host.
This assumption leads to unoptimized buffer management if the :
Custom hardware and GPU (amd , nvidia) are using the same GPU memory
for ex if there is a runtime module with such scenario :
byoc_module (HOST) ---> AMD gpu module (GPU) -----> byoc module (HOST)
then there will be device_copy call embedded between each byoc_module and AMD gpu module. Asuuming that if Custom hardware and GPU memory space is same we can effectively avoid:
- Unnecessary device_copy calls from byoc to GPU modules and vice versa.
- Unnecessary host->device copy calls for byoc module input params and device->host copy calls for result params for BYOC modules.
So basically is there a way in tvm to Annotate relay calls for a CUSTOM byoc target so that tvm can make all the function calls in byoc module as GPU calls and not host calls.
This is strictly for buffer management and i believe includes PlanDevices pass essentially.
Thanks