[pre-RFC] Vectorized TIR Buffers

Lunderberg · July 23, 2021, 6:03pm

Summary

All pointers and buffer allocations in TIR are strongly typed, and may point to either scalar or vectorized elements (e.g. either float32* or float32x2*). Accessing a pointer or buffer with a single-lane index results in a value with the same element type as the pointer/buffer. Accessing a pointer or buffer with a multi-lane index results in a value of type element_type.with_lanes(element_type.lanes() * index.lanes().

Casts between pointer types with the same base type (e.g. from float32* to float32x2*) or between different base types (e.g. from int32* to float16*) may be present in the TIR graph. Optimization passes that can introduce such casts, such as StorageRewrite, should only do so if the target supports these casts. Codegen should not introduce additional pointer type casts beyond those specified in the TIR graph.

Vectorized loads/stores should be specified in the TIR graph as access into a pointer/buffer whose element type is vectorized. Codegen should assume that any multi-lane indices have already been vectorized by the TIR optimization passes if possible, and should not apply vectorization beyond what is specified in the TIR graph.

Motivation

Following TVM PR#8528, which resolved an issue with array access in the Vulkan runtime, led to and/or exposed some inconsistencies in the TIR semantics for buffer indices. Vulkan/SPIR-V require all arrays to be typed, and doesn’t allow type casts that would be permissible in C code, such as casting between float32* and float32x2*. As a result, any vectorized load/store operations must act on an array whose elements are vectorized types. However, this is inconsistent with how stores/loads are expressed in TIR passed to other codegens.

Currently, many places in IR (e.g. Load::Load checking the number of output lanes) and codegen (e.g. CodeGenC's LoadNode loop over output lanes rather than element/index lanes) implicitly assume that all array elements have lanes == 1. This is inconsistent with the type-checking requirements of SPIR-V, which requires vectorized load/store to occur on arrays with multi-lane element type. The TIR semantics should be expanded to cover both use cases, and then be interpreted uniformly across all runtimes.

Guide-level explanation

A “vectorized load” or “vectorized store” refers to memory copies that copy many objects with a single underlying instruction.
A “scalar type” is any TVM Datatype with lanes == 1. These represent a single value.
A “vectorized type” is any TVM Datatype with lanes > 1. These represent several values that can be acted on simultaneously.

By having all vectorized types be explicitly specified in the TIR graph, the logic to identify vectorized access can be moved to an optimization pass, and does not need to be repeated across all runtimes. This will simplify implementation of new codegen targets, as there is less logic that needs to be included in them.

This also allows for additional type-checking to be performed during he codegen. Prior to this RFC, if the datatype associated with a store/load doesn’t match the type stored, the codegen is allowed to add a pointer cast. After this RFC, all pointer casts must be explicitly specified, and any type mismatch is an error.

Reference-level explanation

The following items would need to be implemented for this RFC.

Support for CastNode to indicate a pointer cast. The dtype of Cast(dtype, ptr_value) should be kHandle, with the type annotation of PointerType(PrimType(dtype)).
New optimization pass to identify use of RampNode with stride of 1, and to rewrite as a pointer-cast followed by access with a scalar index.
Updates to StoreNode/LoadNode visitors in all C-based codegens
- Removal of checks for a RampNode index. If any exist after optimization pass, assume that it is deliberate for non-vectorized access.
- Fallback explicit loop should be a loop over the lanes of the index and the lanes of the element type.
Checks in TIR Store::Store and Load::Load include identifying the number of lanes in the array elements, asserting that value_lanes = element_lanes*index_lanes.

Drawbacks

This is an explicit change to the semantics of the TIR graph, which may result in unexpected breakage. Previously, all buffers were assumed to have a scalar elements, and vectorization was done during the codegen step. Allowing buffers to have vectorized elements may

Rationale and alternatives

Possible options for how buffer store/loads should be specified in the TIR graph, what semantics they mean, and how the codegen should interpret it are listed below.

Buffer/pointer types

a. Buffers are untyped, and have no type until/unless cast to an appropriate type (memset semantics).

b. Buffers are typed, and the element type must be scalar (int arr[size]; semantics).

c. Buffers are typed, and the element type may be either scalar or vectorized (int arr[lanes][size]; semantics).
Casting of pointer types

a. The codegen must cast all pointer types to the type specified by StoreNode and LoadNode (i.e. store_node->value.dtype() and load_node.dtype()), regardless of the type of the buffer. This includes casting to a vectorized type with lanes > 1.

b. The codegen must cast all pointer types to type specified by StoreNode and LoadNode, but with the number of lanes set to 1. (i.e. store_node->value.dtype().element_of() and load_node.dtype().element_of()), regardless of the type of the buffer.

c. Pointer types may be cast from one type to another, but it must be explicitly specified in the TIR graph. The dtype Cast(dtype, ptr_value) should be kHandle, with the type annotation set to PointerType(PrimType(dtype)). Optimization passes that may introduce pointer casts should only do so if the target supports them.

d. Pointer types may not be cast from one type to another. CastNode applies only to value types, and not to pointer types.
Index values

a. Indices/offsets are specified as an integer number of bytes.

b. Indices/offsets are specified as an integer number of array elements.
Index lanes, result type

a. Indices must always have exactly one lane. The type of the value accessed is the same as the buffer’s element type.

b. Indices may have more than one lane. The type of the value accessed is the buffer’s element type, but with the same number of lanes as the index. (e.g. When accessing a buffer of type float16x4* with an index of type int32x4*, the result is type float16x4.) This is the current behavior when using a RampNode with stride of 1 as an index.

c. Indices may have more than one lane. The type of the value accessed is the buffer’s element type, but with the number of lanes equal to the product of the number of lanes in the index and the number of lanes in the buffer’s element type. (e.g. When accessing a buffer of type float16x4* with an index of int32x4*, the result is type float16x16)

Prior to this RFC, CodeGenC and subclasses assume that buffers have a scalar element type (option 1b), indices are specified in array elements (option 2b), that indices may have more than one lane (option 3b), and that the codegen may cast pointer types as needed to produce the requested output type (option 4b). To minimize the amount of code change needed, these options should remain the same unless there is a reason to change them.

Typing

Vectorized stores/loads in SPIR-V requires stores/loads to occur on an array with vectorized types. Therefore, option 1c is preferred.

Pointer casting

The stronger typing required by SPIR-V shaders prevents use of option 4b. A pointer can only be dereferenced to its exact element type, and cannot even have a cast between scalar and vectorized. However, forbidding pointer casts altogether would prevent possible optimizations in runtimes that support them, so option 4d shouldn’t be used either. Option 4c, allowing pointer type casts that are explicitly specified TIR, would allow runtimes that support pointer casts to take advantage of them, while avoiding them in runtimes that don’t. This would also add a single optimization pass where vectorized stores/loads are enabled, rather than repeating similar logic checking for ramp nodes in each codegen.

Index values

A byte-offset (option 3a) would make sense for compatibility with option 1a (untyped arrays), but that is not the preferred case. Since it is neither the current convention, nor is there an obvious reason to change it, indices will continue to be specified in terms of array elements (option 3b).

Index lanes, result type

Prior to this RFC, codegen that supports vectorized load/store have special handling of RampNodes as indices (e.g. CodeGenC, CodeGenSPIRV). In the case of CodeGenC, this can apply on arrays with either scalar elements (1-lane elements, N-lane indices, result has N lanes) or vectorized elements (M-lane elements, N-lane indices, result has N lanes), by applying pointer casts. In the case ofCodeGenSPIRV, these can only apply on arrays with vectorized elements. The proposed type of buffer access are summarized in the table below.

Current CUDA behavior	Scalar Index	Vector Index (Ramp with stride=1, `N` lanes)	Vector Index (other, `N` lanes)
Scalar Elements	Scalar	Vector, `N` lanes	Vector, `N` lanes
Vector Elements (`M` lanes)	Not supported	Vector, `N` lanes	Not supported

Proposed semantics	Scalar Index	Vector Index (Ramp with stride=1, `N` lanes)	Vector Index (other, `N` lanes)
Scalar Elements	Scalar	Vector, `N` lanes	Vector, `N` lanes
Vector Elements (`M` lanes)	Vector, `M` lanes	Vector, `N*M` lanes	Vector, `N*M lanes`

This gives consistent semantics, that all access of M-lane with and N-lane index yields a value with N*M lanes (option 3c). This would be coupled with an optimization pass to rewrite RampNode with stride=1 into a pointer cast followed by access with a scalar index, so the overall behavior of the RampNode remains the same.

Prior art

Unknown, suggestions would be appreciated.

Unresolved questions

Is this new set of semantics internally consistent?
Are there incompatibilities between these semantics and other operators?
Are there issues that would arise from exposing functionality currently in the codegen to the TIR graph?
Where else in TIR and codegen are likely to break as a result of allowing vectorized elements in an array?

Future possibilities

Having explicit pointer casts would also simplify the handling of boolean arrays. Prior to this RFC, several locations identify buffers that point to boolean tensors, and convert to an int8 backing array (e.g. Buffer::vload, Buffer::vstore, CodeGenSPIRV). As some runtimes do not support the use of int8, pulling this logic into an optimization pass will simplify this future change.

tqchen · July 23, 2021, 6:05pm

Thanks @Lunderberg . This is a great topic to clarifying the vector Load/Store semantics.

Specifically, the Load/Store follows a high level vector load semantics: The pointer type of the original pointer and always recast the ptr under dtype. To load a vector from the indices, vector indices are used(via ramp and other nodes). The code generator pattern matches on the generated vector indices(e.g. ramp) and alignments to generate the right vector code.

Of course there can be alternative semantics of load. e.g. one example could be allow vector pointer type and change the meaning of indices. Let us just discuss the general pros and cons of each setup.

C0: vector indices
C1: index on vector pointer type

C0 is usually higher level, more flexible and helps earlier phases of vectorization. For example, when we start to vectorize a piece of index access code in VectorizeLoop it is easy to rewrite the indices to vector indices and simplify as much as possible. The drawback of C0 is that it is further away from the codegen and not all patterns can be generated. Usually a scalarization(change back to scalar code) is needed if the vector access pattern does not meet the need

C1 is closer to the codegen, since it is relatively easier to generate the code, but it is also restricted enough that harms certain settings. For example, in the case of GPU it is common to do vectorized load into shared mem but load as scalar. It also makes the vectorize rewriting harder in the high level pass

Right now the TIR follows C0, due to the rationales mentioned in the above. Considering vulkan specifically indeed C1 would help to make more code codegen correctly some cases(by restricting the semantics). However, we can also make C0 work, by inspecting the patterns and choose to scalarize if not all vector access patterns can be supported.

tqchen · July 23, 2021, 6:10pm

Learning from C0 and C1. Perhaps what we really need is a target specific LegalizeVectorization pass. While the TIR C0 semenatics is richer, a specific target may only support a subset (e.g. require vector type not alias in VK, requires aligned load/store).

The LegalizeVectorization will tries to rewrite a high level vectorized code into a restricted subset(e.g. vector load/store that only involves ramp of stride 1), and if we cannot do so, scalarize the code. This pass then might become useful for different backend by configuring the specific subset we are looking for

Lunderberg · July 23, 2021, 6:48pm

Thank you for the C0/C1 definitions. That is exactly the difference I had been thinking of, but hadn’t been aware of the existing terminology. That also makes sense on the pros and cons of C0/C1. That C0 fits better to the for high-level descriptions of the computation to be performed, while C1 fits better to the low-level codegen.

I think that’s the same way I was leaning, where the high-level descriptions are done in either C0 or C1, depending on which is more convenient. The LegalizeVectorization pass would then convert everything to either in the form required based on the capabilities of each target. Whether that means normalizing fully to C1, or having the final result be somewhere in between, I like that method of pulling the logic out into a pass. I’d tend toward having the LegalizeVectorization result in pure C1, but that’s at least partly due to me being most familiar with the Vulkan backend.

What is the advantage in leaving the index as a ramp, rather than having the pass convert it to a type-cast, followed by a scalar index? In my mind, the latter would be a simpler constraint to provide to the codegen (all indices are scalar, all array access results in either the array element’s type, or the explicitly specified type-cast), and a simpler constraint for the legalization pass to follow (when legalizing for targets that do not support type casting, no type casts may be added).

Lunderberg · July 29, 2021, 9:04pm

@masahi Tagging following comments on [Vulkan] Rewrote PointerValueTypeRewrite transform by Lunderberg · Pull Request #8528 · apache/tvm · GitHub