[DISCUSS] Inplace and Fusion Opportunities in LLM Runtime

tqchen · January 23, 2024, 5:13pm

Inplace and Fusion Opportunities

This is a quick note about what opportunities we can enable through inplace support. While inplace operations do not necessarily bring better speedup, there are quite a bit opportunities in turning the program to make better use of memory.

One motivating example is the embedding operator in LLM serving. Consider a multimodal LLM where we have vision encoding and text embedding look ups.

class Module:
    def image_encoding(image, params):
        # a simple vision pipeline to project into embedding space
        lv0 = conv2d(image, params[0])
        lv1 = flatten(lv0)
        lv2 = matmul(lv1, params[1])
        return lv2

    def embedding_lookup(token_ids, params):
        # look up embedding for text
        lv0 = embedding_lookup(token_ids, params[3])
        return lv0

In order to make use of both image and text modality, we will need to run embedding lookup, and image encoding, then concatenate them together during runtime.

def runtime_make_embedding(mod, token_ids, image, params):
    """This code runs in engine to enable some flexible

    customization of batching/chunking strategy"""

    img_embedding = mod["image_encoding"](image, params)
    text_embedding = mod["embedding_lookup")(token_ids, params)
    # assume we had a concat function
    final_embedding = mod["concat"](text_embedding, img_embedding)
    return final_embedding

Note that the above example is somewhat simplified, as there might be interleave of text before/after images, which requires us to have different method of concatenation. The main issue here is that we have to support allocation for img_embedding, text_embedding and final_embedding separately. One main challenge we are facing here is that the size of image text embedding can change depending on scenarios. In serving setting, we would like to have strong control of memory use, and ensure as static allocation as possible. A typical way to control is to enable a maximum chunk size, which corresponds to the maximum sequence that final_embedding can take. We would then use the remaining memory for other use cases(e.g. maximize the kv cache).

For this scenario, ideally we would like a different way to call the function (through destinaiton passing), as show below

class ModuleUpdated:
    @T.prim_func
    def copy_into(
      out: Buffer((256, 128), "f16"),
      inp: Buffer(("m", 128),"f16"),
      offset: T.int32
    ):
        m = T.int32()
        for i, j in grid(m, 128):
            out[i + offset, j] = inp

    def image_encoding(image, params,
                       final_embedding : R.Tensor((256, 128), "f16"),
                       offset):
        # a simple vision pipeline to project into embedding space
        lv0 = conv2d(image, params[0])
        lv1 = flatten(lv0)
        lv2 = matmul(lv1, params[1])
        lv3 = call_tir_place(
          copy_into, [lv2, final_embedding],
          R.Tensor((256, 128), "f16"),
          inplace_indices=[1]
        )
        return lv3

    def embedding_lookup(token_ids, params,
                         final_embedding : R.Tensor((256, 128), "f16"),
                         offset):
        # look up embedding for text
        lv0 = embedding_lookup(token_ids, params[3])
        lv1 = call_tir_place(
          copy_into, [lv0, final_embedding],
          R.Tensor((256, 128), "f16"),
          inplace_indices=[1]
        )
        return lv1

Some remarks:

256 is the maximum sequence length of the final embedding
The image encoding and embedding lookup now explicitly take the final output and an offset(that indicates from which sequence index we should start write things into)
- This can be viewed as a special form of destination passing where we directly write into a region of final data structure
Now the engine can just allocate a fixed size embedding based on the maximum chunk size

def runtime_make_embedding(mod, token_ids, image, params):
    """This code runs in engine to enable some flexible

    customization of batching/chunking strategy"""
    final_embedding = pre_allocate((256, 128), "f16")
    # run inplace operations
    final_embedding = mod["image_encoding"](image, params, final_embedding, 0)
    final_embedding = mod["embedding_lookup")(
          token_ids, params, final_embedding, image_encoding_len)
    return final_embedding

Additionally, we can enhance FuseOps and FuseTensorIR to allow fusion of the copy_into with previous tensorIR operators, this would allows us to get a directly fused operator and allow the matmul or embedding lookup to directly write into the corresponding location of final_embedding.

Noteably, this needs comes when we have a interesting runtime and compiled function interactions. aka, the runtime would like to have some form of fixed memory size allocate for concatenated values, and we can use techniques to transform a function that previously do not handle inplace to an inplace form that is memory efficient.

tqchen · January 23, 2024, 5:13pm

cc @slyubomirsky @MasterJH5574

slyubomirsky · January 23, 2024, 10:21pm

Thank you for starting this discussion, Tianqi. I’ve been looking into doing in-place split and concat, as we’ve discussed before, which also requires writing into buffers with an offset. Accommodating writing with an offset might require some changes in both Relax and TE, as TIR tracks the buffer offset field for tensors but Relax and TE assume an offset of 0 everywhere. On the Relax side, we would need to track offsets in the type system and in TE, we would have to ensure that we can accommodate nonzero offsets or generate TIR code that checks for a dynamic offset.

The way this can be accomplished right now:

s = R.memory.alloc_storage(R.shape([64]), virtual_device_index=0, storage_scope="global", dtype="float32")
t1 = R.memory.alloc_tensor(s, offset=0, shape=R.shape([2, 3]), dtype="float32")
t2 = R.memory.alloc_tensor(s, offset=0, shape=R.shape([6]), dtype="float32")
t3 = R.memory.alloc_tensor(s, offset=32, shape=R.shape([2, 3]), dtype="float32")

The elements of t1 and t2 are aliased while t3 is farther along in the same allocated storage. Another tensor that overlaps these regions could work similarly. This, however, requires messing with the internals of StaticPlanBlockMemory. We could consider making wrappers that are essentially lowered to code like this by StaticPlanBlockMemory, but we would have to deal with the presence of offsets on the TIR side.

I have talked about this with Eric @Lunderberg; he might like to make some comments about the changes that would be needed in TE.

slyubomirsky · January 23, 2024, 10:42pm

Ah, I misunderstood, I see you’re using a PrimFunc to do the copying, which will not require an offset on the buffer itself, so this doesn’t suffer from the issue I’ve mentioned. If we can indeed use fusion to build these offsets into PrimFuncs, that would not be as disruptive.

One thing that might be an issue is that using copy_into is in a sense “invisible” to operators. For example, if you have an output tensor t that is of shape (256,) and write things into it with an offset of 128, other operators will only know that t is of shape (256,) and not that there is new content starting at index 128. Using views or alternative allocations would address this, but would introduce a lot of complexity in other regards.

I think the fusion approach might be easier than messing with memory allocations, though, since it does not require large changes to the language and the worst that could happen is having some leftover intermediate allocations.

tqchen · January 23, 2024, 10:54pm

rigth in this case, we keep all the te as they are, but would have a TensorIR inplace scatter operation that can be fused. This would interact nicely with the runtime layer who might want to allocate the bigger chunk of the memory

slyubomirsky · January 30, 2024, 3:26am

See [Unity][Transform] Handle `call_tir_inplace` in `FuseTIR` and `FuseOps`. by slyubomirsky · Pull Request #16487 · apache/tvm · GitHub. I will have to experiment to see if this will be able to accommodate the test case listed here.