Thank you,also followed up on the PR
We cannot do it for all backends but can afford to do it for subset of backends, e.g. cuda, rocm where the data ptr corresponds to the VRAM pointer. For backends like opencl and metal, the data pointer arith won’t work because the pointer do not corresponds to an address but an opaque buffer object in host.
It is possible to generate kernels that considers offset, which requires us to explicitly construct buffer that comes with elem_offset as an explicit var instead of zero(that causes such specialization). But to make that efficient, we also need to have elem_offset_factor to be multiple of certain values. Additionally, it do cost us an extra parameter to gpu kernels.
As of now our memory allocator can try to allocate without offset and only enable this behavior for subset of the backends (CUDA/Rocm). We are also doing best effort allocating without offset for backends like metal and opencl by creating multiple buffers.
True for OpenCL backend where we have a mix of plain memory (clBuffer) and opaque object (clObject). This offset approach may not generalize here.
We had a similar (not exact) requirement in OpenCL where the reuse requires creation of device specific object which reuse under laying physical memory with different spec. In this case we introduced new device API interface to make handle the usecase.