Runtime Support for Special Memory Scope

ziheng · February 22, 2021, 9:10pm

Motivation

In current TVM system, we categorize the machine’s memory into serval types: global, shared, etc. Based on this, we implicitly allocate and use those memory with the scheduling primitives cache_read / cache_write . However, caching still requires explicit movement during launch of every kernel, it can be beneficial to allocate these special memory explicitly during runtime so they can be reused across operators. Apart from this, TVM assumes that all the memory objects can be manipulated with a plain pointer (void*). The main limitations of the current runtime are:

L0: Only support one kind of memory scope(global on-chip DRAM).
L1: Assumes memory as flat 1D access, cannot support special multi-dimensional layout memory(e.g. texture).

This RFC layouts the details to solve these two limitations.

Case Study: OpenCL Texture

Take the texture memory in OpenCL as an example. A texture can be viewed as a three-dimensional addressable memory:

A[y, x, k] where k’s maximum size is always equals to 4 (RGBA), and y, x corresponds to the spatial location
Most texture memory are cached by tiles (e.g. a sub-matrix in each location will be cached)

To utilize those features, we can re-mapping the memory access as: A[N, C, H, W] = ATexture[N*C/4* H, W, C % 4] . It brings several challenges for runtime:

Texture memories contains 2D load and store rather than 1D.
The specific memory mapping pattern can depend on layout choices

This RFC aims to explain how to better support such memory in TVM, which involves API changes in runtime.

Summary of Key Solutions

S0: Introduce Data Object to Support Memory Tag

To support multiple kinds of memory (L0), we can embed the scope information into a Data object like:

enum MemScope {...};

struct {
  void* data,
  MemScope mem_scope
} Data;

Key Changes:

Update kernel function launching to set argument according to memory tag type
Update device api to introduce allocation with memory scope

S1: Change DeviceAPI to Support Multi-dimensional Allocation and Copy

To support multi-dimensional layout memory (L1), we need to update the underlying runtime APIs with shape information.

Key Changes:

Update device api to support multi-dimensional allocation with shape information;
Update device api and RPC api to support multi-dimensional data copy based on DLTensor;

Runtime API Changes

I attaches the proposed API changes here since they need to be discussed and send by RFC.

DeviceAPI

DeviceAPI::AllocDataSpace

// Current API
public:
virtual void* DeviceAPI::AllocDataSpace(TVMContext ctx,
																	  		size_t nbytes,
																				size_t alignment,
																			  DLDataType type_hint) = 0;

// Proposed API
public:
virtual void* DeviceAPI::AllocDataSpace(TVMContext ctx,
																        std::vector<int64_t> shape,
																        DLDataType dtype,
																        Optional<String> mem_scope=NullOpt) {
  if (!mem_scope || mem_scope.value() == "global") {
    AllocDataSpace(..., nbytes, ...);
  } else {
	  // can be override by backends, will raise error by default implementation.
	  LOG(FATAL) << ...
  }
}
protected:
virtual void*(Data*) DeviceAPI::AllocDataSpace(TVMContext ctx,
																			         size_t nbytes,
																			         size_t alignment) = 0;

We should change the public API to new one with arguments shape, dtype and mem_scope ( because the memory allocation may rely on such layout information), and move the previous nbytes implementation as protected. In default case, the allocation with no scope information will be forwarded to the previous implementation, and raise error if the mem_scope is specified while the device is lack of such implementation.

The return value is kept as void*, but it can be complicated data structure depends on the device’s implementation. For example, it can be a Data*, which we mentioned above.

DeviceAPI::CopyDataFromTo

// Current API
virtual void DeviceAPI::CopyDataFromTo(const void* from, size_t from_offset,
														           void* to, size_t to_offset,
														           size_t num_bytes, TVMContext ctx_from,
																		 	 TVMContext ctx_to, DLDataType type_hint,
																			 TVMStreamHandle stream) = 0;
					
// Proposed API
virtual void DeviceAPI::CopyDataFromTo(DLTensor* from, DLTensor* to,
														           TVMStreamHandle stream) = 0;

Instead of adding new arguments like shape, mem_scope into the original API, we can use the DLTensor* for such job. As well, we can embedded the memory scope information to DLTensor’s data field.

RPC

In general, we need to change the data transfer functions (CopyFromRemote, CopyToRemote, etc.) from using nbytes to using DLTensor* . For example:

// Current API
void RPCClientSession::CopyToRemote(void* from, size_t from_offset, void* to, size_t to_offset, size_t nbytes,
                  T                 VMContext ctx_to, DLDataType type_hint) final;
// Proposed API
void RPCClientSession::CopyToRemote(void* from, size_t nbytes, DLTensor* to) final;

Work of this part is heavy, since there are many classes involves the data transfer procedure. Take CopyToRemote as example, the calling procedure would be:

RPCClientSession::CopyToRemote
-> EndPoint::CopyToRemote
-> EventHandler::HandleCopyToRemote
-> RPCSession::AsyncCopyToRemote
-> RPCSession::CopyToRemote
-> LocalSession::CopyToRemote
-> DeviceAPI::CopyDataFromTo

We need to update all those functions but the idea is same. Also, the RPC version need to be upgraded.

NDArray

NDArray Creation

// Current API
TVM_DLL static NDArray Empty(std::vector<int64_t> shape, DLDataType dtype,
                             DLContext ctx);

// Proposed API
TVM_DLL static NDArray Empty(std::vector<int64_t> shape, DLDataType dtype,
                             DLContext ctx, Optional<String> mem_scope=NullOpt);

With all those backend changes above, now we can create NDArray with specific memory scope.

More details can be found in the PoC PR: [Runtime] Special Memory Scope Support by ZihengJiang · Pull Request #7488 · apache/tvm · GitHub