Motivation
In current TVM system, we categorize the machine’s memory into serval types: global
, shared
, etc. Based on this, we implicitly allocate and use those memory with the scheduling primitives cache_read
/ cache_write
. However, caching still requires explicit movement during launch of every kernel, it can be beneficial to allocate these special memory explicitly during runtime so they can be reused across operators. Apart from this, TVM assumes that all the memory objects can be manipulated with a plain pointer (void*
). The main limitations of the current runtime are:
- L0: Only support one kind of memory scope(global on-chip DRAM).
- L1: Assumes memory as flat 1D access, cannot support special multi-dimensional layout memory(e.g. texture).
This RFC layouts the details to solve these two limitations.
Case Study: OpenCL Texture
Take the texture memory in OpenCL as an example. A texture can be viewed as a three-dimensional addressable memory:
-
A[y, x, k]
where k’s maximum size is always equals to 4 (RGBA), and y, x corresponds to the spatial location - Most texture memory are cached by tiles (e.g. a sub-matrix in each location will be cached)
To utilize those features, we can re-mapping the memory access as: A[N, C, H, W] = ATexture[N*C/4* H, W, C % 4]
. It brings several challenges for runtime:
- Texture memories contains 2D load and store rather than 1D.
- The specific memory mapping pattern can depend on layout choices
This RFC aims to explain how to better support such memory in TVM, which involves API changes in runtime.
Summary of Key Solutions
S0: Introduce Data Object to Support Memory Tag
To support multiple kinds of memory (L0), we can embed the scope information into a Data
object like:
enum MemScope {...};
struct {
void* data,
MemScope mem_scope
} Data;
Key Changes:
- Update kernel function launching to set argument according to memory tag type
- Update device api to introduce allocation with memory scope
S1: Change DeviceAPI to Support Multi-dimensional Allocation and Copy
To support multi-dimensional layout memory (L1), we need to update the underlying runtime APIs with shape information.
Key Changes:
- Update device api to support multi-dimensional allocation with shape information;
- Update device api and RPC api to support multi-dimensional data copy based on DLTensor;
Runtime API Changes
I attaches the proposed API changes here since they need to be discussed and send by RFC.
DeviceAPI
DeviceAPI::AllocDataSpace
// Current API
public:
virtual void* DeviceAPI::AllocDataSpace(TVMContext ctx,
size_t nbytes,
size_t alignment,
DLDataType type_hint) = 0;
// Proposed API
public:
virtual void* DeviceAPI::AllocDataSpace(TVMContext ctx,
std::vector<int64_t> shape,
DLDataType dtype,
Optional<String> mem_scope=NullOpt) {
if (!mem_scope || mem_scope.value() == "global") {
AllocDataSpace(..., nbytes, ...);
} else {
// can be override by backends, will raise error by default implementation.
LOG(FATAL) << ...
}
}
protected:
virtual void*(Data*) DeviceAPI::AllocDataSpace(TVMContext ctx,
size_t nbytes,
size_t alignment) = 0;
We should change the public API to new one with arguments shape
, dtype
and mem_scope
( because the memory allocation may rely on such layout information), and move the previous nbytes
implementation as protected. In default case, the allocation with no scope information will be forwarded to the previous implementation, and raise error if the mem_scope
is specified while the device is lack of such implementation.
The return value is kept as void*
, but it can be complicated data structure depends on the device’s implementation. For example, it can be a Data*
, which we mentioned above.
DeviceAPI::CopyDataFromTo
// Current API
virtual void DeviceAPI::CopyDataFromTo(const void* from, size_t from_offset,
void* to, size_t to_offset,
size_t num_bytes, TVMContext ctx_from,
TVMContext ctx_to, DLDataType type_hint,
TVMStreamHandle stream) = 0;
// Proposed API
virtual void DeviceAPI::CopyDataFromTo(DLTensor* from, DLTensor* to,
TVMStreamHandle stream) = 0;
Instead of adding new arguments like shape
, mem_scope
into the original API, we can use the DLTensor*
for such job. As well, we can embedded the memory scope information to DLTensor’s data field.
RPC
In general, we need to change the data transfer functions (CopyFromRemote
, CopyToRemote
, etc.) from using nbytes
to using DLTensor*
. For example:
// Current API
void RPCClientSession::CopyToRemote(void* from, size_t from_offset, void* to, size_t to_offset, size_t nbytes,
T VMContext ctx_to, DLDataType type_hint) final;
// Proposed API
void RPCClientSession::CopyToRemote(void* from, size_t nbytes, DLTensor* to) final;
Work of this part is heavy, since there are many classes involves the data transfer procedure. Take CopyToRemote
as example, the calling procedure would be:
RPCClientSession::CopyToRemote
-> EndPoint::CopyToRemote
-> EventHandler::HandleCopyToRemote
-> RPCSession::AsyncCopyToRemote
-> RPCSession::CopyToRemote
-> LocalSession::CopyToRemote
-> DeviceAPI::CopyDataFromTo
We need to update all those functions but the idea is same. Also, the RPC version need to be upgraded.
NDArray
NDArray Creation
// Current API
TVM_DLL static NDArray Empty(std::vector<int64_t> shape, DLDataType dtype,
DLContext ctx);
// Proposed API
TVM_DLL static NDArray Empty(std::vector<int64_t> shape, DLDataType dtype,
DLContext ctx, Optional<String> mem_scope=NullOpt);
With all those backend changes above, now we can create NDArray with specific memory scope.
More details can be found in the PoC PR: [Runtime] Special Memory Scope Support by ZihengJiang · Pull Request #7488 · apache/tvm · GitHub