Hi All,
In relax, it is possible to create a storage object, and allow multiple contiguous (or physically 1D, without axis separators) tensors to share the same storage object by specifying offsets to the storage object while allocating tensors.
This facilitates implementation of static memory planners that completely take over memory management - eg allocate one huge memory block (potentially spanning most of available memory) using R.vm.alloc_storage at the beginning of the program, and use only R.vm.alloc_tensors subsequently, by specifying appropriate offsets.
These are some advantages of the above scheme (considering those cases where tensor sizes are known):
- Avoid multiple calls to alloc_storage, which could be expensive (eg. mutex acquisition, OS calls, etc.)
- This is a natural scheme to model and manage Tightly Coupled Memories - whose capacity is typically much smaller than DDR
- Since the scheme is fully aware of how much memory is available, it knows exactly when spills to DDR are needed.
Example of usage - we can see that even conflicting tensors ‘a’ and ‘b’ are created from the same storage object, but they are allocated disjoint memory regions of the same storage object:
@R.function
def main(
x: R.Tensor((4, 64), "int32"),
y: R.Tensor((4, 64), "int32"),
z: R.Tensor((4, 64), "int32"),
) -> R.Tensor((4, 64), "int32"):
cls = Module
vtcm_obj: R.Object = R.vm.alloc_storage(
R.shape([4096]), runtime_device_index=0, dtype="uint8", storage_scope="global.vtcm"
)
a: R.Tensor([4, 64], dtype="int32") = R.vm.alloc_tensor(
vtcm_obj, offset=0, shape=R.shape([4, 64]), dtype="int32"
)
__: R.Tuple = R.vm.copy_tensor(x, a)
b: R.Tensor([4, 64], dtype="int32") = R.vm.alloc_tensor(
vtcm_obj, offset=1024, shape=R.shape([4, 64]), dtype="int32"
)
_: R.Tuple = R.vm.copy_tensor(y, b)
c: R.Tensor([4, 64], dtype="int32") = R.vm.alloc_tensor(
vtcm_obj, offset=2048, shape=R.shape([4, 64]), dtype="int32"
)
___: R.Tuple = cls.compute_add_in_vtcm(a, b, c)
_t1: R.Tuple = R.vm.kill_object(a)
_t2: R.Tuple = R.vm.kill_object(b)
d: R.Tensor([4, 64], dtype="int32") = R.vm.alloc_tensor(
vtcm_obj, offset=0, shape=R.shape([4, 64]), dtype="int32"
)
___1: R.Tuple = R.vm.copy_tensor(z, d)
e: R.Tensor([4, 64], dtype="int32") = R.vm.alloc_tensor(
vtcm_obj, offset=1024, shape=R.shape([4, 64]), dtype="int32"
)
___2: R.Tuple = cls.compute_mul_in_vtcm(c, d, e)
…
However, for allocating discontiguous tensors (eg. a physically 2D tensor, where the first dimension specifies an offset to a pointer table pointing to fixed size blocks, and the second dimension specifies offset in a block), currently we need to create multiple storage objects for conflicting tensors.
In order to overcome this limitation and enable better memory planning for discontiguous tensors, we propose to add a new interface to relax (we are focusing on VM target for now):
def alloc_discontiguous_tensor(
ptr_table_storage: Expr, ptr_table_storage_offset: Union[int, Expr], data_storage: Expr, data_storage_offsets: Expr, shape: Expr, dtype: Union[str, Expr]
) -> Call:
"""Construct a Call to allocate a discontiguous tensor. The storage for pointer table and data are specified separately.
Parameters
----------
ptr_table_storage : Expr
The storage for pointer table for the tensor.
ptr_table_storage_offset : Union[int, Expr]
The storage offset to allocate the pointer table for the tensor.
data_storage : Expr
The storage for data for the tensor.
data_storage_offsets : Expr
The storage offsets from data_storage, where actual data will be stored. Number of elements in this list should be shape[1]
shape : Expr
The physical shape of the tensor to be allocated (2D)
dtype : Union[str, Expr]
The datatype of the tensor to be allocated.
Returns
-------
result : Call
A relax Call, which gets the allocated tensor.
"""
Example of usage - here we can see that even conflicting tensors a and b are created from the same storage objects (global_obj and vtcm_obj)
@R.function
def main(
x: R.Tensor((4, 64), "int32"),
y: R.Tensor((4, 64), "int32"),
z: R.Tensor((4, 64), "int32"),
) -> R.Tensor((4, 64), "int32"):
cls = Module_2d
vtcm_obj: R.Object = R.vm.alloc_storage(
R.shape([4096]), runtime_device_index=0, dtype="uint8", storage_scope="global.vtcm"
)
global_obj: R.Object = R.vm.alloc_storage(
R.shape([64]), runtime_device_index=0, dtype="uint8", storage_scope="global"
)
a: R.Tensor([4, 64], dtype="int32") = R.vm.alloc_discontiguous_tensor(
global_obj, 0, vtcm_obj, data_storage_offsets=R.shape([768, 256, 2304, 3072]), shape=R.shape([4, 64]), dtype="int32"
)
__: R.Tuple = R.vm.copy_tensor(x, a)
b: R.Tensor([4, 64], dtype="int32") = R.vm.alloc_discontiguous_tensor(
global_obj, 16, vtcm_obj, data_storage_offsets=R.shape([1536, 1280, 3328, 2560]), shape=R.shape([4, 64]), dtype="int32"
)
_: R.Tuple = R.vm.copy_tensor(y, b)
c: R.Tensor([4, 64], dtype="int32") = R.vm.alloc_discontiguous_tensor(
global_obj, 32, vtcm_obj, data_storage_offsets=R.shape([512, 0, 2048, 3840]), shape=R.shape([4, 64]), dtype="int32"
)
___: R.Tuple = cls.compute_add_in_vtcm(a, b, c)
Could you please share your feedback on the above interface? Please let me know if I’m missing something.
Thank you!
CC: @tqchen