Unpack API with Relax

abhikran-quic · June 24, 2024, 11:02am

We’ve observed significant overhead (approximately 1 million cycles) due to packedFunc calls, which accounts for a substantial percentage of overall cycles for our models. We would like to see some methods of avoiding packedFunc calls and instead pass the parameters directly to the subsequent ops as they will be running on the same processor.

In the relay’s Ahead-of-Time (AOT) workflow, I see that the model compilation happens without packedFunc calls via a TIR pass: MakeUnpackedAPI. However, when I attempted to set the target as llvm --host=llvm, this workflow didn’t work directly for me. Instead, I had to add some intrinsic calls in codegen_llvm.cc, which allowed me to generate LLVM IR for a simple example. This approach worked well on hardware.

After going through the Relax VM design, I understand that Relax is primarily designed to work with packed_func calls, utilizing the TIR MakePackedAPI pass.

Now, I’m seeking suggestions on how to effectively incorporate MakeUnpackedAPI into the Relax workflow.

Thank you in advance!

CC: @tqchen @Lunderberg

tqchen · June 24, 2024, 2:06pm

Thanks for the note, as of now previously we did not see the packed func as a bottleneck. But love to learn about the use-case and improving the relax system performance overhead

abhikran-quic · June 25, 2024, 5:28am

Thank you @tqchen for your response!

Our goal is to enhance model performance on Hexagon and to ensure that data fits into VTCM, we need to slice the operators(including memory management) resulting in a higher number of calls(cpacked_lowered and packed_lowered). All these calls go through packedFunc and when I use LWP to profile the overhead introduced by calling TVMBackendAnyListSetPackedArg , TVMBackendAnyListMoveFromPackedReturn , tvm_struct_set etc, it falls in between 1000 - 1500 cycles/op invocation. While this number seems small for a single operator, for a simple model like ResNet50, we have over 800 such calls to cpacked_lowered and packed_lowered , resulting in a significant amount of time spent on packing and unpacking.

Given this, we’re interested in optimizing the runtime overhead caused by packedFunc calls. Would you recommend trying to run MakeUnpackedAPI with the relax workflow?

As always, thank you for your inputs/suggestions!

Lunderberg · July 30, 2024, 4:11pm

Another option, in addition to MakeUnpackedAPI, would be to generate PrimFunc implementations that do not have the "global_symbol" attribute (tvm::attr::kGlobalSymbol in the C++ API). Granted, this depends on where the caller is located, as it requires the caller and callee to be in the same IRModule at that point. For calls from Relax to TIR (as I’m guessing from the TVMBackendAnyListSetPackedArg), it’s less useful, since an internal TIR function wouldn’t be callable from the Relax VM.

There are also a few ways that the number of calls to PackedFunc instances used by the Relax VM could be reduced. While they aren’t (yet) in the default Relax pipeline, the relax transforms ExpandTupleArguments, RemoveUnusedParameters, RemoveUnusedOutputs, and InlinePrivateFunctions may reduce the number of calls required to prepare/return subroutine calls, by either replacing the tuple arguments with directly-representable arguments, or by removing the function calls altogether.

tqchen · July 31, 2024, 1:51pm

Great discussions, sorry for being late on this. I think one way to address this issue would be to have some form of Continuous TIR call inlining pass after the lowering, let us call it FuseAndUnpackContinuousTIRCalls.

The following code shows one possible Before/After of this transformation. The pass will

Recognize a continuous region of low level TIR calls that only involves types it recognize, e.g.
- Storage
- Tensor
- Calls into functions that we know won’t retain the memory of Tensor
Lift that continuous region into another TIR function fused_bulk_call
Create unpacked variant of the tir functions involved and redirect calls as unpacked calls.

@I.module
class Before:
    @T.prim_func
     def add(
          A: T.Buffer((10, 20), "float32"),
          B: T.Buffer((10, 20), "float32")
    ):
         ...
   
    @T.prim_func
     def relu(
          A: T.Buffer((10, 20), "float32"),
          B: T.Buffer((10, 20), "float32")
    ):
         ...

    @R.function
    def main(X: R.Tensor((10, 20), "float32")):
         # code after memory planning
         s0 = R.memory.alloc_storage((200), "float32")
         s1 = R.memory.alloc_storage(((200), "float32")
   
         lv0 = R.memory.alloc_tensor(s0, R.Tensor((10, 20), "float32")
         lv1 = R.memory.alloc_tensor(s0, R.Tensor((10, 20), "float32")
    
         R.call_tir_lowered(add, [X, lv0])
         R.call_tir_lowered(relu, [lv0, lv1])
         return lv1


@I.module
class After:
    @T.prim_func
     def add_unpacked(
          Adata: T.handle,
          Bdata: T.handle
    ):
        A = T.Buffer(Adata, (10, 20), "float32")
        B = T.Buffer(Bdata, (10, 20), "float32")
        ...
   
    @T.prim_func
     def relu_unpacked(
          Adata: T.handle,
          Bdata: T.handle
    ):
        A = T.Buffer(Adata, (10, 20), "float32")
        B = T.Buffer(Bdata, (10, 20), "float32")
        ...
      
    @T.prim_func
     def fused_bulk_call(
          A: T.Buffer((10, 20), "float32"),
          s0: T.Buffer((100,), "float32"), 
          B: T.Buffer((10, 20), "float32")
    ):
        # or have another intrinsic to directly call unpacked
        T.call_extern(add_unpacked, A.data, s0.data)
        T.call_extern(relu_unpacked, s0.data, B.data)

    @R.function
    def main(X: R.Tensor((10, 20), "float32")):
         # code after memory planning
         s0 = R.memory.alloc_storage((200), "float32")
         s1 = R.memory.aloc_storage(((200), "float32")
   
         lv0 = R.memory.alloc_tensor(s0, R.Tensor((200), "float32")
         lv1 = R.memory.alloc_tensor(s1, R.Tensor((10, 20), "float32")
    
         R.call_tir_lowered(fused_bulk_call, [X, lv0, lv1])
        return lv1

The main convention we need to resolve here are two needs:

N0: the need to remain flexible and enable a composite set of generic types
N1: the need to be efficient, and knowing that low-level TIR codegen generally do not manage memory and reference counting.

The solution aims to split the program by recognizing that some of the high-level code in N0 are necessary but most part of the code have simpler types, so we can afford to perform allocation outside in relax VM, and then call into a fused function that simply operates on the already allocated memory.

If done correctly, this could bring the benefit of efficiency while still enable generality for code that needs them

abhikran-quic · August 6, 2024, 4:52am

Thank you @tqchen and @Lunderberg for your ideas!

Currently, I am engaged on some high priority tasks requiring immediate attention and hence I will try this approach once I have some more bandwidth available.