Memory Manager: Handling offset param in AllocNDArray

arangasa · November 28, 2023, 2:21am

Currently the ‘offset’ parameter in AllocNDArray sets dl_tensor object’s byte_offset field (https://github.com/apache/tvm/blob/main/src/runtime/memory/memory_manager.cc#L92) but this byte_offset field seems to be ignored during llvm codegen. Is this the case for other targets as well?

In the Unity branch, the relax vm’s memory manager handled the offset parameter of AllocNDArray by adjusting the data pointer ([Unity] Replace relax_vm/memory_manager with memory/memory_manager (#… · apache/tvm@a9c81a7 · GitHub).

Can we handle offset like how Unity branch handled offsets (as in this PR: https://github.com/apache/tvm/pull/16168)? @yongwww @yelite @srkreddy1238 @tqchen - Could you please suggest?

Thank you!

tqchen · November 28, 2023, 2:34am

Thank you,also followed up on the PR We cannot do it for all backends but can afford to do it for subset of backends, e.g. cuda, rocm where the data ptr corresponds to the VRAM pointer. For backends like opencl and metal, the data pointer arith won’t work because the pointer do not corresponds to an address but an opaque buffer object in host.

It is possible to generate kernels that considers offset, which requires us to explicitly construct buffer that comes with elem_offset as an explicit var instead of zero(that causes such specialization). But to make that efficient, we also need to have elem_offset_factor to be multiple of certain values. Additionally, it do cost us an extra parameter to gpu kernels.

As of now our memory allocator can try to allocate without offset and only enable this behavior for subset of the backends (CUDA/Rocm). We are also doing best effort allocating without offset for backends like metal and opencl by creating multiple buffers.

srkreddy1238 · November 28, 2023, 4:02am

True for OpenCL backend where we have a mix of plain memory (clBuffer) and opaque object (clObject). This offset approach may not generalize here.

We had a similar (not exact) requirement in OpenCL where the reuse requires creation of device specific object which reuse under laying physical memory with different spec. In this case we introduced new device API interface to make handle the usecase.

Ref. [OpenCL][Texture] Improved texture memory planning by srkreddy1238 · Pull Request #15058 · apache/tvm · GitHub

tqchen · November 28, 2023, 2:54pm

Right, for such cases, overloading the Allocator would be better