RFC: Relax - Static Memory Planning and Discontiguous Tensors

arangasa · September 24, 2023, 2:47pm

Hi All,

In relax, it is possible to create a storage object, and allow multiple contiguous (or physically 1D, without axis separators) tensors to share the same storage object by specifying offsets to the storage object while allocating tensors.

This facilitates implementation of static memory planners that completely take over memory management - eg allocate one huge memory block (potentially spanning most of available memory) using R.vm.alloc_storage at the beginning of the program, and use only R.vm.alloc_tensors subsequently, by specifying appropriate offsets.

These are some advantages of the above scheme (considering those cases where tensor sizes are known):

Avoid multiple calls to alloc_storage, which could be expensive (eg. mutex acquisition, OS calls, etc.)
This is a natural scheme to model and manage Tightly Coupled Memories - whose capacity is typically much smaller than DDR
Since the scheme is fully aware of how much memory is available, it knows exactly when spills to DDR are needed.

Example of usage - we can see that even conflicting tensors ‘a’ and ‘b’ are created from the same storage object, but they are allocated disjoint memory regions of the same storage object:

@R.function
def main(
    x: R.Tensor((4, 64), "int32"),
    y: R.Tensor((4, 64), "int32"),
    z: R.Tensor((4, 64), "int32"),
) -> R.Tensor((4, 64), "int32"):
    cls = Module
     vtcm_obj: R.Object = R.vm.alloc_storage(
        R.shape([4096]), runtime_device_index=0, dtype="uint8", storage_scope="global.vtcm"
    )
    a: R.Tensor([4, 64], dtype="int32") = R.vm.alloc_tensor(
        vtcm_obj, offset=0, shape=R.shape([4, 64]), dtype="int32"
    )
    __: R.Tuple = R.vm.copy_tensor(x, a)
    b: R.Tensor([4, 64], dtype="int32") = R.vm.alloc_tensor(
        vtcm_obj, offset=1024, shape=R.shape([4, 64]), dtype="int32"
    )
    _: R.Tuple = R.vm.copy_tensor(y, b)
    c: R.Tensor([4, 64], dtype="int32") = R.vm.alloc_tensor(
        vtcm_obj, offset=2048, shape=R.shape([4, 64]), dtype="int32"
    )
    ___: R.Tuple = cls.compute_add_in_vtcm(a, b, c)
    _t1: R.Tuple = R.vm.kill_object(a)
    _t2: R.Tuple = R.vm.kill_object(b)
    d: R.Tensor([4, 64], dtype="int32") = R.vm.alloc_tensor(
        vtcm_obj, offset=0, shape=R.shape([4, 64]), dtype="int32"
    )
    ___1: R.Tuple = R.vm.copy_tensor(z, d)
    e: R.Tensor([4, 64], dtype="int32") = R.vm.alloc_tensor(
        vtcm_obj, offset=1024, shape=R.shape([4, 64]), dtype="int32"
    )
    ___2: R.Tuple = cls.compute_mul_in_vtcm(c, d, e)

…

However, for allocating discontiguous tensors (eg. a physically 2D tensor, where the first dimension specifies an offset to a pointer table pointing to fixed size blocks, and the second dimension specifies offset in a block), currently we need to create multiple storage objects for conflicting tensors.

In order to overcome this limitation and enable better memory planning for discontiguous tensors, we propose to add a new interface to relax (we are focusing on VM target for now):

def alloc_discontiguous_tensor(
   ptr_table_storage: Expr, ptr_table_storage_offset: Union[int, Expr], data_storage: Expr, data_storage_offsets: Expr, shape: Expr, dtype: Union[str, Expr]
 ) -> Call:
"""Construct a Call to allocate a discontiguous tensor. The storage for pointer table and data are specified separately.

Parameters
----------
ptr_table_storage : Expr
    The storage for pointer table for the tensor.

ptr_table_storage_offset : Union[int, Expr]
    The storage offset to allocate the pointer table for the tensor.

data_storage : Expr
    The storage for data for the tensor.

data_storage_offsets : Expr
    The storage offsets from data_storage, where actual data will be stored. Number of elements in this list should be shape[1]

shape : Expr
    The physical shape of the tensor to be allocated (2D)

dtype : Union[str, Expr]
    The datatype of the tensor to be allocated.

Returns
-------
result : Call
    A relax Call, which gets the allocated tensor.
"""

Example of usage - here we can see that even conflicting tensors a and b are created from the same storage objects (global_obj and vtcm_obj)

@R.function
def main(
    x: R.Tensor((4, 64), "int32"),
    y: R.Tensor((4, 64), "int32"),
    z: R.Tensor((4, 64), "int32"),
) -> R.Tensor((4, 64), "int32"):
    cls = Module_2d
         vtcm_obj: R.Object = R.vm.alloc_storage(
        R.shape([4096]), runtime_device_index=0, dtype="uint8", storage_scope="global.vtcm"
    )
    global_obj: R.Object = R.vm.alloc_storage(
        R.shape([64]), runtime_device_index=0, dtype="uint8", storage_scope="global"
    )
    a: R.Tensor([4, 64], dtype="int32") = R.vm.alloc_discontiguous_tensor(
        global_obj, 0, vtcm_obj, data_storage_offsets=R.shape([768, 256, 2304, 3072]), shape=R.shape([4, 64]), dtype="int32"
    )
    __: R.Tuple = R.vm.copy_tensor(x, a)
    b: R.Tensor([4, 64], dtype="int32") = R.vm.alloc_discontiguous_tensor(
        global_obj, 16, vtcm_obj, data_storage_offsets=R.shape([1536, 1280, 3328, 2560]), shape=R.shape([4, 64]), dtype="int32"
    )
    _: R.Tuple = R.vm.copy_tensor(y, b)
    c: R.Tensor([4, 64], dtype="int32") = R.vm.alloc_discontiguous_tensor(
        global_obj, 32, vtcm_obj, data_storage_offsets=R.shape([512, 0, 2048, 3840]), shape=R.shape([4, 64]), dtype="int32"
    )
    ___: R.Tuple = cls.compute_add_in_vtcm(a, b, c)

Could you please share your feedback on the above interface? Please let me know if I’m missing something.

Thank you!

CC: @tqchen

srkreddy1238 · September 24, 2023, 4:30pm

I too came across a static memory requirement with Adreno where there exist on chip memory which can be used before spilling to DDR. Similar scenario was faced in relay (graph memory planner) too while reusing memory across clBuffer and clImages.

Currently the static plan memory block does all the planning.

To make things generic probably we could have a device specific planners with fallback to default planner. Basically the storage token generator can be part of device_api with a fallback to current default storage token generator.

Here the token may hold device private data containing offsets, storage specifiers …etc.

tqchen · September 25, 2023, 6:25pm

I think this is something related to heterogenous execution that @yongwww is working on

srkreddy1238 · September 26, 2023, 4:43am

I think this scenario is related to single target where the tensor memory allocation has hierarchy (or priority until it is full). TVM’s heterogeneous (For example BYOC or Collage) beings in data copies assuming different memory objects with in one kernel execution is not possible, not sure if something has improved in recent times. But, in this case (at least on-chip memory and DDR allocations of Adreno) we hint the device alloc API about placement. The hint may be an out come of device specific storage planner.

arangasa · September 26, 2023, 1:45pm

Thank you for your helpful inputs, @srkreddy1238; agree with you that we could have a device specific storage planner.

Just to clarify, this interface came out of a need to support discontiguous tensors in a single target with different types of memories.

Summarizing Existing Relax Support:

Allows programmers to create storage objects (with mem_scope) - implemented using DeviceAPI services - and allocate tensors from storage objects.
Programmers can specify an offset from the storage object, from which tensors could be allocated.

With these, we can express memory planning at relax IR level itself, using regular relax operations (in contrast to a memory planning scheme that uses tokens/private data that are not visible at relax IR level).

The proposed interface to allocate discontiguous tensors from storage objects is a natural extension/generalization of the existing relax IR support, for discontiguous tensors: that is, the proposed interface allows expression of memory planning for discontiguous tensors using regular relax operations (as opposed to a scheme that uses tokens/private data that are not visible at relax IR level). The proposed interface allows static planning for pointer tables as well.

If we want to support n-dimensional physical tensors, we can further generalize the interface so that the second argument is also a vector of offsets, representing the offsets for pointer tables (eg. in a breadth-first manner, starting from the most-indirect level).

vtcm_obj: R.Object = R.vm.alloc_storage(
    R.shape([256]), runtime_device_index=0, dtype="uint8", storage_scope="global.vtcm"
)
global_obj: R.Object = R.vm.alloc_storage(
    R.shape([48]), runtime_device_index=0, dtype="uint8", storage_scope="global"
)
a: R.Tensor([4, 2, 16], dtype="int16") = R.vm.alloc_discontiguous_tensor(
    global_obj, pointer_table_offsets=R.shape([0, 16, 24, 32, 40]), vtcm_obj, data_storage_offsets=R.shape([0, 32, 64, 96, 128, 160, 192, 224]), shape=R.shape([4, 2, 16]), dtype="int16"
)

In the above example, there are 5 pointer tables: one 2-indirect table with 4 entries (table size is 16 bytes), four 1-indirect tables with 2 entries (each table is of size 8 bytes), and eight data blocks, each containing 16 “int16” elements - assuming pointer size (for both types of memory) is 4 bytes.

The data for the discontiguous tensor spans the entire vtcm_obj, and the pointer tables together span the entire global_obj.

Thank you!

srkreddy1238 · September 26, 2023, 4:10pm

BTW, the mem_scope now is a String with in a Virtual Device now. We could probably realize it as an object holding additional attributes. The additional pass for memory planning can manipulate this object and same can be used at runtime for allocating the storage objects and tensors.