TIR - Non contiguous memory

RoyS · September 20, 2023, 11:48am

Hi, My internal hardware memory (vtcm) is divided into 2 physical blocks. TVM’s TIR buffers are allocated contiguously. This is a problem for me because I want to implement these buffers in a non-contiguous memory in AllocDataSpace in order to be able to utilize the the double buffer mechanism effectively (loading into block 0 while computing in block 1). How can I do that?

Thanks.

sanirudh · September 20, 2023, 12:21pm

TIR does ND allocation through axis_separator. But if axis_separator is specified for a buffer, it has to be supported in the codegen lowering for that particular target as the idea of ND Allocation is an abstract concept and what the layout and memory access looks like could depend on each target.

RoyS · September 20, 2023, 1:24pm

Thanks for your reply, appreciate the help. Is that possible to achieve using TE?

I am trying to implement a matmul schedule as such:

M = te.var("M", "int32")

K = te.var("K", "int32")

N = te.var("N", "int32")

# Algorithm

k = te.reduce_axis((0, K), "k")

A = te.placeholder((M, K), name="A")

B = te.placeholder((K, N), name="B")

C = te.compute((M, N), lambda m, n: te.sum(A[m, k] * B[k, n], axis=k), name="C")

######### START OF SCHEDULE ###########

s = te.create_schedule(C.op)

##### System ####

##### Read (A,B) External -> Internal

AL = s.cache_read(A, "local.vtcm", [C])

BL = s.cache_read(B, "local.vtcm", [C])

##### Write (C) Internal -> External

CC = s.cache_write(C, "local.vtcm_Block_2")

##### For now, lets set tile_k to be 128

tile_k = 128

tile_m = 64

tile_n = 64

##### Tiling for Reading AL buf , BL buf , and for the kernel CC #####

m_a, n_a = s[AL].op.axis

ma_outer,na_outer,ma_inner,na_inner = s[AL].tile(m_a,n_a,tile_m,tile_k)

m_b, n_b = s[BL].op.axis

mb_outer,nb_outer,mb_inner,nb_inner = s[BL].tile(m_b,n_b,tile_k,tile_n)

mc, nc = s[CC].op.axis

mc_outer,nc_outer,mc_inner,nc_inner = s[CC].tile(mc,nc,tile_m,tile_n)

##### Kernel ####

# We want to load 16 units each iteration

no, ni = s[CC].split(nc_inner, factor=16)

(kaxis,) = s[CC].op.reduce_axis

ko,ki = s[CC].split(kaxis, factor=tile_k)

s[CC].reorder(ko,mc_inner,ki, no, ni)

s[CC].vectorize(ni)

##### Tiling for Writing C to external ######

m, n = s[C].op.axis

m_outer,n_outer,m_inner,n_inner = s[C].tile(m,n,tile_m,tile_n)

###### move Read and Write to the compute Tiling iterator #######

s[AL].compute_at(s[CC],ko)

s[BL].compute_at(s[CC],ko)

s[CC].compute_at(s[C],n_outer)

The following IR is generated for AL, BL:

A_local_vtcm = T.allocate([8192], "float32", "local.vtcm")
B_local_vtcm = T.allocate([8192], "float32", "local.vtcm")

However, I need it in this shape: [2][4096].

If not possible to achieve in TE, how does axis_separator supported in the DeviceAPI in the AllocDataSpace function?

sanirudh · September 20, 2023, 2:53pm

When you say the shape should be [2, 4096], what should the buffer access expression be? Based on that, you can probably use te.transform_layout to get it with te.AXIS_SEPARATOR used in the index map.

I’m not sure how far the te.transform_layout is supported, for example a non-bijective index map might not be supported, so it could be tricky to write down the index map in your case.

Anyways, the way it works as far as I understand is that you can set the AXIS_SEPARATOR in the TIR/TE schedules and when it gets lowered, the buffer should get converted to something like [2, 4096]. For example, check the below example with tir schedules that I’m more comfortable with:

import tvm
from tvm import te, tir
from tvm.script import tir as T

@T.prim_func
def before_vectorize(a: T.handle, b: T.handle) -> None:
    A = T.match_buffer(a, (128, 32))
    B = T.match_buffer(b, (128, 32))
    C = T.alloc_buffer(shape=(128, 32))
    for i, j in T.grid(128, 32):
        with T.block("C"):
            vi, vj = T.axis.remap("SS", [i, j])
            C[vi, vj] = A[vi, vj] * 2.0
    for i, j in T.grid(128, 32):
        with T.block("B"):
            vi, vj = T.axis.remap("SS", [i, j])
            B[vi, vj] = C[vi, vj] * 2.0

sch = tir.Schedule(before_vectorize)
print(tvm.lower(sch.mod))
sch.transform_layout("C", "C", lambda i, j: [i%2, te.AXIS_SEPARATOR, i//2, j])
print(tvm.lower(sch.mod))

The lowered TIR before transform layout would be printed as: C = T.allocate([4096], "float32", "global") and after transform_layout would be C = T.allocate([2, 2048], "float32", "global").

I guess this is what you want. This is taken care of by StorageFlatten/FlattenBuffer passes for te/TIR workflow, where the indices are flattened only within the edges and an AXIS_SEPARATOR location.

Then during codegen this has to be taken care of to generate proper 2D access for the BufferLoad and at runtime, this version of AllocDataSpace can be implemented for the target device to handle 2D allocation (note that ndim would be equal to 2 which would be the way to identify the buffer type at runtime).

RoyS · September 21, 2023, 1:28am

This helped me so much! I managed to do it successfully on my side!

One last obstacle remaining on my side - I want to use TE’s double_buffer, however I receive this error when I try to print the schedule after applying s[AL].double_buffer():

InternalError: Check failed: op->extents.size() == 1 (2 vs. 1) : InjectDoubleBuffer expects flat 1-d buffers. Has StorageFlatten (TE-based schedules) or FlattenBuffer (TIR-based schedules) been run?

This makes sense actually… But how do you suggest to overcome this issue?

Thanks again, your help is much appreciated.

sanirudh · September 21, 2023, 2:49pm

I actually have never used the double buffer, so not really sure how that works and how that maps to 2D tensors, but based on the error it seems like the InjectDoubleBuffer pass only works with 1D buffer (this is just a guess based on the error message).

I don’t think I can help with this issue as I don’t have any understanding of this pass, but I’ll go through it and post anything I understand.