[RFC] Texture memory support

csullivan · March 19, 2021, 11:49pm

Motivation

High performance inference on mobile phones often relies on the use of heterogeneous memory. This RFC proposes a path to support 2d texture memory in TVM as 1) a schedulable global memory and 2) a first class storage in the graph runtime. Additionally, while this RFC also provides an implementation for targeting OpenCL image2d_t RGBA textures, the foundational changes proposed here should be directly applicable to texture memory in other platforms.

Background on Texture

Texture memories contains 2D load and store rather than 1D,

// Example OpenCL kernel that reads in the texture memory
__kernel void TextureProgram(__read_only image2d_t input_texture) {
	// Ways to access the texture, we don't need very complicated ways
  // as we only want to make use of the texture cache.
  const sampler_t sampler = CLK_NORMALIZED_COORDS_FALSE | CLK_ADDRESS_CLAMP | CLK_FILTER_NEAREST;
  int2 pos;
  pos.x = 1;
  pos.y = 2;
  // load float4 value from the texture memory.
  float4 value = read_imagef(input_texture, sampler, pos);
}

A common feature of texture memory is the spatial cacheing they can provide. Most texture memory loads are cached by tiles, such that a sub-matrix in each location will be cached. This can decrease the memory latency for loading consecutive elements along both width and height of the 2d texture image, which can be useful for ML operators like convolution which require spatial accesses along two or more dimensions.
The lowest most dimension of a 2d texture, the channel axis, is typically of length 4(RGBA).
Additionally, many GPUs have a special pipeline and cache for texture memory, that can make it useful to shuffle data into texture first before reading it repetitively.

Approach

In general there are two approaches for supporting texture memory in TVM.

A0: Support texture memory as a schedulable global scratch space.
A1: Support texture as a special memory that can be used in the runtime.

This RFC proposes the enablement of both complimentary approaches, the details of which are described below.

A0: Support texture memory as a schedulable global scratch space

As a schedulable memory, texture can be exposed as a fast special memory (like shared-memory in GPU) that can be scheduled via an explicit cache_read. For example,

import tvm
from tvm import te

def addone(shape=(32, 32, 4), dtype="float32"):
  X = te.placeholder(shape, dtype=args.dtype)
  Y = te.compute(shape, lambda i, j, k: X[i, j, k] + 1)
  s = te.create_schedule(Y.op)
  Xt = s.cache_read(X, "texture", [Y])

  # copy to texture stage
  x, y, c = s[Xt].op.axis
  s[Xt].bind(x, te.thread_axis("blockIdx.x"))
  s[Xt].bind(y, te.thread_axis("threadIdx.x"))
  s[Xt].vectorize(c)

  # the compute stage
  x, y, c = s[Y].op.axis
  xo, yo, xi, yi = s[Y].tile(x, y, 4, 4)
  s[Y].bind(xo, te.thread_axis("blockIdx.x"))
  s[Y].bind(yo, te.thread_axis("threadIdx.x"))
  s[Y].vectorize(c)
  return tvm.driver.lower([X, Y])

The above code will likely generate the following two subsequent kernels

// Pseudocode for illustration only

// blockIdx.x in [0, 32]
// threadIdx.x in [0, 32]
__kernel data_shuffle(float* X[32][32][4], texture float* Xt[32][32][4]) {
	Xt[blockIdx.x][threadIdx.x] = (float4*)(X[blockIdx.x][threadIdx.x])[0];
}

// blockIdx.x in [0, 8]
// threadIdx.x in [0, 8]
__kernel compute(float* Y[32][32][4], texture float* Xt[32][32][4]) {
	for (int xi = 0; xi < 4; ++ xi) {
    for (int yi = 0; yi < 4; ++ yi) {
      float4 value = texture2d_load(Xt[blockIdx.x * 4 + xi][threadIdx.x* 4 + yi]);
	    Y[blockIdx.x * 4 + xi][threadIdx.x * 4 + yi] = value + 1;
    }
  }
}

// Mock runtime
void host_main(float* X[32][32][4], float* Y[32][32][4]) {
   // allocate a temp texture(cached in the runtime for reuse)
   texture Xt = AllocaTempTexture(32, 32);
   data_shuffle<<<32,32>>>(X, Xt);
   compute<<<4,4>>>(Y, Xt);
}

From the users’ point of view, texture is a special memory in the memory hierarchy, that can be used as an efficient cached multi dimensional load for repeated vector access.

Limitation:

L0: A potential limitation of this approach is the data movement it requires. The cost of shuffling data from a global buffer to texture memory can nullify a portion of the performance gain achieved from utilizing the texture cache. For this case, the approach A1 approach is described below.

To support texture as a schedulable global memory the following tasks are required:

T0: Texture memory lowering

Introduce a texture memory flattening pass that recognizes memory with texture scope, and lower the BufferRealize/Load/Store of the texture into builtin TIR intrinsics, tir.texture2d_alloca , tir.texture2d_load, and tir.texture2d_store.
Flatten the N-dimension access via to two dimensional access via flattened indices for texture width and height according to one of the defined texture lowering conventions.
Automatically vectorize loop of length 4 along RGBA channel axis, the fastest changing dimension (FCD)

Texture Layout

The current lowering path assumes two dimensional textures will always have the lowest dimension of size 4 (RGBA); in this way, the use of packed layouts for activations and weights are required.

As part of the Nd to 2d flattening process, this RFC proposes the definition of two lowering conventions for activations and weights respectively,

C0: storage_scope == “texture”, [A, B, C, D, 4] → [A * B * C, D, 4]
C1: storage_scope == “texture:weight”, [A, B, C, D] → [A, B * C * D, 4]

Separate Nd to 2d lowering conventions are employed for activation and for weights. For activations all dimensions save the last (excluding the vector length 4 dimension) are packed into columns of the texture image; for weights all dimensions save the first are packed into rows of the texture image.

Note: if any other weight layout lowering is desired, one can apply logical shape changes via a te.compute in the topi compute definition to ensure the flattening occurs as desired.

T1: Code generation support

Provide OpenCL code generation support to lower the load and store texture intrinsics to read_image and write_image.
Lower tir::Vars of a TextureType (or PointerType with scoping support) to OpenCL type image2d_t.
Infer read and write only access qualifiers to texture memory in order for the downstream compiler to best utilize texture cache pipeline.
Support single element indexing into an RGBA type4 texture read to enable outer product style broadcasting.

An implementation of T0 and T1 can be found in PR #7686.

T2: Runtime support for texture scratch allocation

Lower tir.texture2d_alloca to either a runtime packed function or C runtime API call that returns a handle of defined texture.
Generate Free for texture allocation so that workspace recycling can be employed for the texture allocation.
Register both runtime PackedFunc that allocate and free the texture workspace
Build a runtime allocator that allocates texture memory with recycling, creating a large enough texture space for general use.

The OpenCL Device API changes proposed in T2 can be found in PR #7711.

A1: Support texture as a special memory that can be used in the runtime

Supporting texture memory in the runtime has the advantage of enabling operators to consume inputs from texture and write outputs directly to texture without requiring any data movement to or from global memory. For mobile platforms where “global” memory may actually be system memory shared by all platform processing elements, not distinct to the GPU, the latency of reading from global system memory can be quite expensive. In this case it is advantageous to maintain the primary working set of inference intermediate memory in GPU-local texture cache.

In order to use both global and texture memory together in the runtime, TVM must support heterogeneous memory. This can be achieved generically through the use of optional storage scopes in,

L0: The device runtime
L1: Relay graph runtime codegen, compile engine, and memory planning (and the equivalent in the VM executor)
L2: The graph runtime (or VM runtime)

L0: Device runtime support

Device runtime support for special memory scopes were recently added in https://discuss.tvm.apache.org/t/runtime-support-for-special-memory-scope/9187. The goal of this RFC was to allow NDArray allocation with the use of a storage scope. This enables the Device API to allocate memory of a specific type for use in the runtime.

Additionally the AllocDataSpace API was refactored to allocate storage based on an ND shape rather than a flat size. This can allow backends then to target finite shape based allocations of 1 (flat buffer or 1d texture), 2 (2d texture, image2d_t), or 3 (Set of 2d textures, image2d_array_t) dimensions.

This RFC proposes to utilizes the texture layout lowering conventions described above in T0 for lowering Nd-shape allocations with the texture or texture:weight scopes to 2d clCreateImage allocations. An implementation of the proposed changes for L0 (and also T2 above) can be found in PR #7711.

L1: Relay graph runtime codegen, compile engine, and memory planning

Note: As a first implementation the graph runtime will be the focus of this RFC, but a mostly equivalent discussion can be made for the VM runtime.

In addition to low level device runtime support, the upper levels of the graph runtime, as well as the Relay lowering paths must make use of special memory storage scopes. This RFC proposes to support this starting from Relay level memory planning and propagating storage information down to graph runtime codegen, compile engine, and ultimately the graph runtime.

L1.1: Graph memory planner

The memory planner is responsible for planning the runtime memory use. This RFC proposes a key change to allow storage scope information to be derived from the planned relay function based on a specific backend target’s capability.

Introduce a target dependent CollectStorageInfo pass that collects storage scope info for the output each operation, similar in functionality to CollectDeviceInfo for heterogeneous compute,

/*! \return The internal token map */
  std::unordered_map<const ExprNode*, std::vector<StorageToken*> > GetInitTokenMap(
      const Function& func, **const TargetsMap& targets**) {
    node_device_map_ = CollectDeviceInfo(func);
    **node_storage_map_ = CollectStorageInfo(func, node_device_map_, targets);**
    this->Run(func);
    return std::move(token_map_);
  }

CollectStorageInfo will dispatch to a packed function registered for a specific target, for example,

TVM_REGISTER_GLOBAL("relay.backend.opencl.adreno._CollectStorageInfo")
  .set_body_typed(CollectTextureStorage);

and if no such implementation exists, the storage mapping for each operator will remain empty and the memory plan defaults to global scope.

Update memory planner to support storage token optimization based on the collected storage scope information. See addendum below.

L1.2: Graph Runtime Codegen

The Graph Runtime Codegen backend for Relay then uses the new storage scope information provided by the memory planner to:

Annotate the graph JSON with an additional storage_scope field along side the existing fields such as storage_id so that the graph runtime may use this scope to generate a memory scope tag to pass to the device runtime as described in L0.
Introduce the CollectBufferBinds packed function which generates tir::Buffers (with the storage_scope set from the storage info plan) for each primitive function and provides these to the CompileEngine for use in lowering the prim func via the binds field of tvm::lower/build.

L1.3: Compile Engine

Update compile engine APIs to consume the above mentioned tir::Buffers to be used as tensor binds, and also part of the prim func’s CCacheKey so that structurally equivalent primitive functions with distinct function signatures are cached separately.

L1.4: Graph Runtime

Update the graph runtime to parse the new storage_scope field, and utilize it congruously with the storage_ids to build runtime dataspace storage pool allocations of the correct storage scope.

See PR #7688 for foundational changes corresponding to L1.1-4, and Adreno specific target implementations of CollectStorageInfo and CollectStorageBinds can be found in #7689.

Addendum: Storage scope based memory planning

In addition to collecting a target’s supported storage scope information for each operator output, the StorageToken allocation framework provided by the graph memory planner can be extended to support planning based on the storage scope.

This RFC proposes to extend the existing one-dimensional bin packing employed by the memory planner to also support two dimensional storage token optimization for use with storage tokens of texture scope. #7690

GraphMemoryPlanner

Refactor StorageAllocator Request, Alloc, and CheckRelease into a TokenAllocator, which can dispatch to a one or two dimensional token allocator based on the token storage scope.
TokenAllocator1D’s implementation is identical to the current implementation.
TokenAllocator2D utilizes a similar bin packing strategy but also tracks the excess space that increasing a storage pool would incur. The strategy is as follows,
- For a given token allocation request search among the idle texture pools, keeping track of which pool would require the least amount of additional memory. If a requested allocation fits within an existing pool, no additional memory would be required. If not, expand the pool which results in the least amount of added space.
- For any set of free/idle storage pools which do not require expansion for the given allocation request, choose the pool which would minimize the amount of wasted memory.
- Note also that during optimization requests are only queried against candidate pools of the same type as this is a requirement of texture memory alignment.

Graph runtime

The graph runtime PoolEntry previously tracked a single dimension for allocation, size. It is proposed to replace the size field with a std::vector<int64_t> shape. For global scope, this shape is only one dimensional. For texture scope this shape is three dimensional (width, height, channel length 4).
The graph runtime uses the storage ids and scope information to grow pool entries along one or two axes for storage pools of global or texture scope, respectively.
For 1D pools nothing has changed, the size of the allocation is determined from the maximum number of bytes required for all entries in a storage pool, and the resulting pool is of kDLFloat type and byte aligned.
For 2D texture pools the size of an image pool is determined from the maximum number of row and column elements required for all entries of a storage pool, and retains the type of the underlying pool entries.

Addendum: Adreno target and topi schedules

This RFC proposes the introduction of the opencl --device=adreno target and corresponding relay strategies to take advantage of texture support in TVM. The conv2d schedules introduced here utilize spatial and channel-wise packing for the weights (OIHW4o) and activations (NCHW4c), respectively, both with vector length 4 to support lowering to RGBA texture memory. See PR #7687.

Summary

A0 and A1 of this proposal enable the use of texture memory in both the operator and graph runtimes in TVM and together can improve TVM’s ability to generate state of the art inference runtimes for mobile phones. Your comments and review on this proposal are highly appreciated. Relevant PRs have been noted in each section of this RFC, and are reproduced here,

[Texture support][Part 0] Device API and runtime support #7711

[Texture support][Part 1] TIR lowering and OpenCL support #7686

[Texture support][Part 2] Add opencl adreno target, topi schedules, and relay op strategies #7687

[Texture support][Part 3] Support storage scope tag in graph runtime codegen, planning, runtime and compile engine #7688

[Texture support][Part 4] Add CollectStorageInfo and CollectBufferBinds relay passes for Adreno GPU #7689

[Texture support][Part 5] Graph runtime and memory planning support for 2d allocations #7690

masahi · March 20, 2021, 12:40am

I think texture support for Vulkan / SPIR-V would be a great follow up! I’ll put that in my list.