[RFC] Unified Static Memory Planning

Background

Currently, given a ML model primarily TVM will generate two main artifacts :

  • A1 : Description of the sequential execution of operators :
    1. If the “executor” is “graph”, this would be a JSON
    2. if the “executor” is “aot”, this would be a main function describing call graph of operators
  • A2 : library of operators (in the form of runtime.Module)

A1 is generally created out of lowering the “main” relay function and A2 is created lowering fused relay primitive functions → TIR PrimFuncs → C or LLVM artifacts of the operator library.

Is there some sort of memory planning already being performed ?

Yes, there is.

For A1, the inter-(fused) operator tensors are visible in the “main” relay function. Thus, there exists currently a Relay level pass known as “GraphPlanMemory” that works on the Relay IR to share the space used by tensors which are not live simultaneously and are visible between (fused) operators . Currently, the said pass will use Shared Memory Buffer Object memory planning scheme (See Optimizing TensorFlow Lite Runtime Memory — The TensorFlow Blog) to perform the planning.

For A2, the operators are lowered to TIR PrimFuncs. There exist a pass called StorageRewrite that more or less does the same thing as “GraphPlanMemory” but on TIR for the tensors visible within (fused) operators and are not live simultaneously.

Motivation

For embedded use-cases, its widely accepted that aggressive memory optimizations are vital. Intially we are looking at enable memory planning for embedded use-cases using the AoT executor.

Therefore, there exist two main shortcomings of the current approach :

  • The memory used by intermediary tensors within operators are not shared between memory used by inter-operator tensors.

Example TIR :

primfn(placeholder_3: handle, placeholder_4: handle, placeholder_5: handle, T_cast_1: handle) -> ()
  attr = { "global_symbol" :  "fused_nn_conv2d_add_fixed_point_multiply_clip_cast_cast_21" ,  "tir.noalias" : True}
  buffers = {T_cast: Buffer(T_cast_2: Pointer(int16), int16, [ 1 ,  56 ,  56 ,  128 ], []),
  placeholder_2: Buffer(placeholder_6: Pointer(int32), int32, [ 1 ,  1 ,  1 ,  128 ], []),
  placeholder: Buffer(placeholder_7: Pointer(int16), int16, [ 1 ,  56 ,  56 , 128 ], []),
  placeholder_1: Buffer(placeholder_8: Pointer(int16), int16, [ 3 ,  3 ,  128 ,  1 ], [])}

   buffer_map = {placeholder_3: placeholder, placeholder_4: placeholder_1, placeholder_5: placeholder_2, T_cast_1: T_cast} {
   attr [PaddedInput: Pointer(int16)]  "storage_scope" =  "global" ;
   allocate(PaddedInput, int16, [ 430592 ]);
   attr [DepthwiseConv2d: Pointer(int32)]  "storage_scope" =  "global" ;

   allocate(DepthwiseConv2d, int32, [ 401408 ]) {
     for (i1: int32,  0 ,  58 ) {
       for (i2: int32,  0 ,  58 ) {
        for(i3: int32,0,128) {
           PaddedInput[(((i1*7424) + (i2*128)) + i3)] = @tir.if_then_else(((((1<= i1) && (i1 < 57)) && (1<= i2)) && (i2 < 57)), (int16*)placeholder_7[((((i1*7168) + (i2* 128 )) + i3) - 7296)], 0i16, dtype=int16)
         }

The above TIR snippet shows that two intra operator buffers PaddedInput, DepthwiseConv2d is not visible to Relay Graph Plan Memory to be shared.

  • Assumption of local optimization : performing sharing inside the operator first and sub-subsequently sharing that workspace with inter-operator tensors, would be sub-optimal.

Thus, for the embedded use-cases, we’d need a unified static memory planner that performs memory planning of all tensors holistically to achieve best memory utilization.

Goals

G1. There would be no TVMBackendAlloc(/Free)Workspace calls generated for tir.allocates that could be evaluated at compile time.

Currently, the TVM codegen and the AoT executor relies on TVMB(A/F)W calls to increment/decrement a pointer of user provided workspace buffer. By the end of this set of work, if the backend uses Unified Static Memory Planning, there should not be TVMB(A/F)W calls rather correct offset in to the user provided buffer should be codegen’d for allocates that could be evaluated at compile time. The dynamically sized allocates will remain untouched, thus will be lowered as usual.

G2. The static memory planning algorithm should be changeable.

There are a variety of memory planning algorithms in discussion with different tradeoffs (See [Discussion/Alignment] Memory Planning and Optimizing TensorFlow Lite Runtime Memory — The TensorFlow Blog). Depending on the topology and schedules of intermediary buffers, the memory planning algorithm should easily be able to be change able. However, the current design ties the algorithm intimately to the IR constructs – making it harder to modularize / change the algorithm w/o inventing a whole new pass. In reality, the outcome of USMP’s algorithm is offsets within a given workspace buffer. Moreover, to produce that it should only need to know the sizes of each tensor and their relative liveness. Therefore, the algorithm interface to USMP should be kept simple to be able to add more algorithms.

G3. Multiple pool support (including constants)

Ideally, the user would expect to provide these buffers in the granularity of the memories they’d want to pin them to. E.g., if there are two RW memories : DRAM and SRAM, the buffers need to be identified and pooled by the compiler. Similiarly, for constant data, we need to have a mechanism to allow user to pin them to appropriate memories and addresses in the IR would simply be offsets into the constant buffer(s) provided by the user

Application Usecases

U1: Most simple use case

TVMC

tvmc compile my_model.tflite -- executor = aot -- output-format = mlf - -target=c

Codegen’d artifacts

`//Codegen'd artifacts in metadata.c (lib0.c)`
const TVMModel my_model = {
   ...
   .entrypoint = &entrypoint,
}

static uint8_t workspace_buffer[WORKSPACE_BUFFER_SIZE];
static const uint8_t parameters_buffer[PARAMETERS_BUFFER_SIZE] = <compiler_generated_constant_data>;

static int32_t entrypoint(TVMInputs_my_model* inputs, 
                          TVMOutputs_my_model* outputs,
                           TVMContext* context){
     return my_model_main(inputs[0],
                          inputs.input0, 
                          outputs.output0,
                          &workspace_buffer,
                          parameters_buffer,
                          context.resource_handle);
}

// metadata.h

typedef struct {
   uint8_t* input0;
}  TVMInputs_my_model;

typedef struct {
   uint8_t* output0;
}  TVMOutputs_my_model;

User Application

// The User Application 
    extern  const TVMModel my_model;
       int main(...) {
            ...
            TVMInputs_my_model inputs = {my_data};
            TVMOutputs_my_model outputs = {output_space};
            TVMExecute(&my_model,
                       &inputs,
                       &outputs,  
                       NULL);
        }

U2: User wants to share workspaces

TVMC

tvmc compile my_model_1.tflite
--executor=aot 
--output-format=mlf
--target=accel,c  
--with-workspace-buffer= "name=sram;target=c,accel"

tvmc compile my_model_2.tflite 
--executor=aot
--output-format=mlf 
--target=accel,c
--with-workspace-buffer= "name=sram;target=c,accel"

Codegen’d Artifacts

//Codegen'd artifacts in metadata.c (lib0.c)
const TVMModel my_model_1 = {
   ...
   .entrypoint = &entrypoint,
}

static const uint8_t parameters_buffer[PARAMETERS_BUFFER_SIZE] = <compiler_generated_constant_data>;

 static int32_t entrypoint(TVMInputs_my_model_1* inputs, 
                           TVMOutputs_my_model_1* outputs, 
                           TVMContext* context){

   return my_model_1_main(inputs[0], 
                          inputs.input0, 
                          outputs.output0,
                          parameters_buffer,
                          context.workspaces.sram, 
                          context.resource_handle);
}

// metadata.h

#define TVM_MY_MODEL_1_SRAM_WORKSPACE_BUFFER_SIZE xxxx

typedef struct {
   uint8_t* sram;
}  TVMWorkspaces_my_model_1;

typedef struct {
   uint8_t* input0;
}  TVMInputs_my_model_1;

typedef struct {
   uint8_t* output0;
}  TVMOutputs_my_model_1;

//Codegen'd artifacts in metadata.c (lib0.c)

const TVMModel my_model_2 = {
   ...
   .entrypoint = &entrypoint,
}

static const uint8_t parameters_buffer[PARAMETERS_BUFFER_SIZE] = <compiler_generated_constant_data>;

static int32_t entrypoint(TVMInputs_my_model_2* inputs, 
                          TVMOutputs_my_model_2* outputs, 
                          TVMContext* context){

   return my_model_2_main(inputs[0],
                          inputs.input0,
                          outputs.output0,
                          parameters_buffer,
                          context.workspaces.sram, 
                          context.resource_handle);
}

// metadata.h

#define TVM_MY_MODEL_2_SRAM_WORKSPACE_BUFFER_SIZE xxxx

typedef struct {
   uint8_t* sram;
}  TVMWorkspaces_my_model_2;

typedef struct {
   uint8_t* input0;
}  TVMInputs_my_model_2;

typedef struct {
   uint8_t* output0;
}  TVMOutputs_my_model_2;

User Application

// The User Application    
    extern  const TVMModel my_model_1;
    extern  const TVMModel my_model_2;

    // Please calculate the maximum of TVM_MY_MODEL_1_SRAM_WORKSPACE_BUFFER_SIZE and TVM_MY_MODEL_2_SRAM_WORKSPACE_BUFFER_SIZE and define it as TVM_MY_MODELS_COMMON_WORKSPACE_BUFFER_SIZE
    // Alternatively, user could use a malloc (if permitted and desired) for runtime calculation of the max
    static uint8_t workspace_buffer[TVM_MY_MODELS_COMMON_WORKSPACE_BUFFER_SIZE];

        int main(...) {
            ...
            TVMContext context;
            TVMInputs_my_model_1 inputs = {my_data_1};
            TVMOutputs_my_model_1 outputs = {output_space_1};
            TVMWorkspaces_my_model_1 workspaces1 = {
                .sram = &workspace_buffer,
            };
            TVMSetWorkspaces(&context, &workspaces1);
            TVMExecute(&my_model_1, &inputs_1, &outputs_1, &context);
            ...
            TVMInputs_my_model_2 inputs = {my_data_2};
            TVMOutputs_my_model_2 outputs = {output_space_2};
            TVMWorkspaces_my_model_2 workspaces2 = {
                .sram = &workspace_buffer,
            };
            TVMSetWorkspaces(&context, &workspaces2);
            TVMExecute(&my_model_2, &inputs_2, &outputs_2, &context);
            ...
        }

U3 : User wants to pin buffers to different memories

TVMC

tvmc compile my_model.tflite 
--executor=aot 
--target=accel,c  
--with-workspace-buffer= "name=dtcm;target=c;size=1000" # Here the size is more of a hint/guide provided to USMP
--with-workspace-buffer= "name=sram;target=c,accel"
--with-parameter-buffer= "name=itcm;target=c;size=5000" # Here the size is more of a hint/guide provided to USMP
--with-parameter-buffer= "name=flash;target=c,accel"

Codegen’d Artifacts

//Codegen'd artifacts in metadata.c (lib0.c)
const TVMModel my_model = {
   ...
   .entrypoint = &entrypoint,
}

static int32_t entrypoint(TVMInputs_my_model* inputs, 
                           TVMOutputs_my_model* outputs, 
                           TVMContext* context){

     return my_model_main(inputs.input0,
                          outputs.output0,
                          context.workspaces.dtcm,
                          context.workspaces.sram,
                          context.parameters.itcm,
                          context.parameters.flash, 
                          context.resource_handle);
}

// metadata.h

#define TVM_MY_MODEL_DTCM_WORKSPACE_BUFFER_SIZE xxxx
#define TVM_MY_MODEL_SRAM_WORKSPACE_BUFFER_SIZE xxxx
#define TVM_MY_MODEL_ITCM_PARAMETER_BUFFER_SIZE xxxx
#define TVM_MY_MODEL_FLASH_PARAMETER_BUFFER_SIZE xxxx

typedef struct {
   uint8_t* dtcm;
   uint8_t* sram;
}  TVMWorkspaces_my_model;

typedef struct {
   uint8_t* itcm;
   uint8_t* flash;
}  TVMParameters_my_model;

typedef struct {
   uint8_t* input0;
}  TVMInputs_my_model;

typedef struct {
   uint8_t* output0;
}  TVMOutputs_my_model;

User Application

// The User Application 
    extern  const TVMModel my_model;
    __attribute__((section( "ITCM" )  const uint8_t   my_model_params_1[TVM_MY_MODEL_ITCM_PARAMETER_BUFFER_SIZE] = <param_1_data>;
    __attribute__((section( "FLASH" ), aligned( 16 )))  const uint8_t my_model_params_2[TVM_MY_MODEL_FLASH_PARAMETER_BUFFER_SIZE] = <param_2_data>;
    __attribute__((section( "DTCM" )  static uint8_t workspace_buffer_1[TVM_MY_MODEL_DTCM_WORKSPACE_BUFFER_SIZE];
    __attribute__((section( "SRAM" ), aligned( 16 )))  static uint8_t workspace_buffer_2[TVM_MY_MODEL_SRAM_WORKSPACE_BUFFER_SIZE];

int main(...) {
     ...
     TVMContext context;
     TVMInputs_my_model_1 inputs = {input};
     TVMOutputs_my_model_1 outputs = {output};
     TVMWorkspaces_my_model workspaces = {
         .sram = &workspace_buffer_1,
         .dtcm = &workspace_buffer_2,
     };
     TVMParameters_my_model parameters = {
         .flash = &my_model_params_1,
         .itcm = &my_model_params_2
     };
     TVMSetWorkspaces(&context, &workspaces);
     TVMSetParameters(&context, parameters);
     TVMExecute(&my_model, &inputs, &outputs, &context);
}

Proposed Implementation of the Design

Overview

This should be a IRModule (TIR) → IRModule (TIR) pass.

Inputs :

  • AoT TIR PrimFunc ( the control function describing the call graph to operators)
  • All Operator Functions
  • the maximum size for each pool We could use “pinned_memory” (see below) to tag buffers with suggested priority order determined by the scheduler.

The idea is USMP will try to pool them using the preferred “pinned_memory” and fallback whenever the size is exceeding the user provided max size for each pool (if any)

Outputs :

  • AoT TIR PrimFunc accepting pool buffers from the user.
  • All Operator functions accepting pool buffers.
    • Each operator function should address using the correct offset in the correct pool buffer

Special Parametric Inputs :

  • function : The algorithm to be used for planning From a component PoV, the algorithm is a special input with a defined interface.

The current proposal for the interface is as follows :

struct BufferInfo {
    Integer uid;
    Integer size_bytes;
    Integer alignment;
    Array<Integer> conflicts; //the conflicting uids of buffers`
    Array<Integer> pool_candidates;`
    String pool_name;`
    Integer pool_offset;`
}

void (*foo)(Array buffers, Map<String, Integer> pool_sizes)

Special Considerations :

  • tir.constants : TIR does not have the ability to represent constants – which is limiting and often leads to having side-channels to carry constants between TIR compiler passes including this one. Therefore, in this work as a pre-requisite we should aim to fix this by supporting tir.constants (similiar to relay.constants).
    • Why do we need constants expressed in TIR ?
      • If not, it should be represented as inputs to TIR main function (logic : anything that is not expressible in TIR will become inputs). In which case, we would need to associate that Var with a special tag to indicate its constant and its metadata (e.g., desired pools, alignment requirements, etc.)
  • Currently “with” or “let” scopes are tree structured and carry transitive property. E.g, if tensor A is live with tensor B && tensor B is live with tensor C → tensor A is live with tensor C – which may not be true always. Thus current “let” or “with” scopes are unable to express liveness information. Therefore, we’d need a side-channel to express this information.

How the input TIR to USMP should be lowered ?

Step 1 : The bound relay.const in Relay IR should be lowered via TE → TIR as tir.constants

After Step 1 (introducing tir.constants to hold constant data) : the TIR code should like as follows :

# This snippet shows the format of pre-USMP pseudo TIR code.

def main(input1: ty.handle, output1: ty.handle):
   my_model_fused_op1 = tir.allocate(..., pinned_memory=[1, 2])
   my_model_fused_op2 = tir.allocate(..., pinned_memory=[2])
   tir.call("my_model_fused_op1", input1, my_model_fused_op1, fused_op1_weights, fused_op1_biases)
   tir.call( "my_model_fused_op2" , my_model_fused_op1, my_model_fused_op2, fused_op2_weights, fused_op2_biases)

def my_model_fused_op1(input : ty.handle, output : ty.handle):
   tir.func_attr({"global_symbol":"my_model_fused_op1","tir.noalias": True})
   intermediate_tensor_1 = tir.allocate(..., pinned_memory=[1, 2]) # By  default they will have all possible memories
   intermediate_tensor_2 = tir.allocate(..., pinned_memory=[1, 2]) # unless scheduler removes them
   weights = tir.constant(..., pinned_memory=[0, 3])
   biases = tir.constant(..., pinned_memory=[0, 3])
   ...
   <compute>
   ...

def my_model_fused_op2(input : ty.handle, output : ty.handle):
   tir.func_attr({"global_symbol":"my_model_fused_op2", "tir.noalias": True})
   intermediate_tensor_1 = tir.allocate(..., pinned_memory=[1, 2])
   intermediate_tensor_2 = tir.allocate(..., pinned_memory=[1, 2])
   weights = tir.constant(..., pinned_memory=[0, 3])
   biases = tir.constant(..., pinned_memory=[0, 3])
   ...
   <compute>
   ...
Step 2 : Run an analysis pass to populate a Map<tir::Var, BufferInfo> that contains buffer information as defined above (See the struct BufferInfo).
Step 3 : Use the updated Map<tir::Var, BufferInfo> to generate Array, Map<String, Integer> pool_sizes
Step 4 : Call the provided/default algorithm (void (*foo)(Array buffers, Map<String, Integer> pool_sizes) to populate pool_id and pool_offset.
Step 5 : Use the updated Map<tir::Var, BufferInfo> (with pool_id and pool_offset) mutate the IR that would result as following :

# This snippet shows the format of post-USMP pseudo TIR code.

def main(input1: ty.handle, output1: ty.handle, params_1 : ty.handle, params_2 : ty.handle, workspace_1 : ty.handle, workspace_2 : ty.handle):
   tir.call("my_model_fused_op1", input1, params1, params2, workspace_1, workspace_2)
   tir.call("my_model_fused_op2", params1, params2, workspace_1, workspace_2)

def my_model_fused_op1(input, params_1, params_2, workspace_1, workspace_2):
   tir.func_attr({"global_symbol":"my_model_fused_op1","tir.noalias":True})
   intermediate_tensor_1=tir.load("uint8", workspace_1.data, <offset>)
   intermediate_tensor_2=tir.load("uint8", workspace_1.data, <offset>)
   output=tir.load("uint8", workspace_1.data, <offset>)
   weights=tir.load("uint8", params_1.data, <offset>)
   biases=tir.load("uint8", params_1.data, <offset>)
   ...
   <compute>
   ...

def my_model_fused_op2(params_1, params_2, workspace_1, workspace_2):
   tir.func_attr({"global_symbol":"my_model_fused_op2","tir.noalias":True})
   input=tir.load("uint8", workspace_1.data, <offset>)
   intermediate_tensor_1=tir.load("uint8", workspace_1.data, <offset>)
   intermediate_tensor_2=tir.load("uint8", workspace_2.data, <offset>)
   output=tir.load("uint8", workspace_2.data, <offset>)
   weights=tir.load("uint8", params_1.data, <offset>)
   biases=tir.load("uint8", params_2.data, <offset>)
   ...
   <compute>
   ...

Discussion

We would like to hear general feedback on the proposed design, in addition to anticipated discussions as follows :

TVMC command line options

T1. --with-parameter-buffer & --with-workspace-buffer

T2. --with-constant-buffer & --with-workspace-buffer

T3. --parameter-buffer & --workspace-buffer

T4. --constant-buffer & --workspace-buffer

We would like to hear the community’s opinion how to present constant buffers (from a TVM’s PoV) or model parameter buffers (from a User’s PoV).

cc : @areusch @Mousius @giuseros @matt-arm @r.stahl @stoa @tgall_foo @leandron @ramana-arm

3 Likes

hi @manupa-arm, thanks for posting this! there’s a lot to unpack here.

I think we can break the work here into two parts:

P1. Implementing the unified memory planner based on information in TIR

P2. Modifying the codegen/output to implement various compiler optimizations based on P1.

I think that the debate around P1 is likely to center around the “how,” whereas the debate around P2 is likely to center around the “what.”

Modeling the whole program in TIR

So far the AOT effort has made some initial effort here by creating a top-level TIR function which describes the top-level model. One open question related to this RFC is: how should we structure the compiler around this top-level program? In general, we have a couple of options:

S1. Place everything in TIR, and implement post-scheduling transforms as compiler passes. In the S1 world, any computed information e.g. memory placement for buffers would need to live in TIR. In this world, we should strive to avoid side-channel information carried outside of TIR.

S2. Keep with the piecewise representation, and build separate data structures to encapsulate compiler outputs from post-schedule passes e.g. memory planning.

I think currently @jroesch and @csullivan support S1 (see PR 7518, which my understanding says is still being worked on but which is often merge-conflicted). I also support this if it’s feasible to do so under all executors. I think the drawback is that non-AOT executors will need to run these passes, but the advantage is that it provides a clear framework under which we can consolidate post-scheduling whole-program modeling for both AOT and non-AOT use cases. Should we consider superseding VM executor with AOT in the future, it also provides a more natural pathway. I’m curious as to your opinions on this?

I bring this up because I think a lot of questions raised here and elsewhere in the proposal can likely be decided based on how we decide this general design pattern.

Inline questions

A couple other questions:

static int32_t entrypoint(TVMInputs_my_model* inputs, 
                          TVMOutputs_my_model* outputs,
                          TVMContext* context){

Just to confirm–would TVMContext also be generated e.g. TVMContext_my_model

Inputs :

  • AoT TIR PrimFunc ( the control function describing the call graph to operators)
  • All Operator Functions
  • the maximum size for each pool We could use “pinned_memory” (see below) to tag buffers with suggested priority order determined by the scheduler.

The idea is USMP will try to pool them using the preferred “pinned_memory” and fallback whenever the size is exceeding the user provided max size for each pool (if any)

Outputs :

  • AoT TIR PrimFunc accepting pool buffers from the user.
  • All Operator functions accepting pool buffers.
    • Each operator function should address using the correct offset in the correct pool buffer

I’m not certain the memory planner should necessarily encode all vars as buffer offsets–doing so could limit e.g. dynamic use cases, which may either a) need to express offsets as runtime-evaluated expressions or b) need to entirely defer such allocations to runtime, should it be impossible to pre-define such expressions.

This gets at my separation of concerns above–it would be nice to either

  1. use the TIR-agnostic I/O format as a way to store the memory planner output and then inform further TIR modifications (e.g. either making everything buffer offsets when possible, passing those offsets in as positional arguments, or keeping TVMBAW for dynamic allocs)
  2. represent that abstract output as e.g. TIR attributes and perform any of the aforementioned optimizations by examining TIR attributes

The current proposal for the interface is as follows :

struct BufferInfo {
    Integer uid;
    Integer size_bytes;
    Integer alignment;
    Array<Integer> conflicts; //the conflicting uids of buffers`
    Array<Integer> pool_candidates;`
    Integer pool_id;`
    Integer pool_offset;`
}

void (*foo)(Array buffers, Map<Integer, Integer> pool_sizes)

In the tvmc command above, memory pools were identified by name. Any reason to translate to integers here?

Special Considerations :

Let’s discuss these after resolving S1/S2 debate above.

cc @tqchen @junrushao1994 f you have comments on representing this in TIR

Hi @areusch

Thanks for taking time to read this!.

Yes, I do generally support the idea of having the whole program lowered to TIR. Im not sure about the VM and how important “static” memory planning is for VM. I think going forward graph executor might be able to load a packed function of the tvm_main instead of json – it’ll be less confusing as how the graph executor runtime is positioned as of today which is more of a (a very thin – as its supposed to be :slight_smile: ) middleware that connect the graph json and the compiled operator library.

Having said that, I can see this work enabling a path (to extend) towards that – though we only plan to create USMP component that is a TIR IRModule → TIR IRModule which we initially test and support for the AoT executor.

Here I am referencing what is being discussed here : [RFC] [uTVM] Embedded C Runtime Interface - #6 by Mousius. I think its better to reach an agreement there. Here Im trying to illustrate and motivate the design using the APIs.

Yes, the USMP will only touch tir.allocates that could be evaluated in the compile time will be translated to offset. We could just leave the rest untouched to be TVMBAWs that is ultimately handled in the runtime (using malloc or stack_allocator). I think thats the only thing we require. Do we miss anything here? – I’ll adjust the original text to reflect this.

No reason :slight_smile: . Yes we could use the names – so its more clear.

hi @manupa-arm,

I think going forward graph executor might be able to load a packed function of the tvm_main instead of json – it’ll be less confusing as how the graph executor runtime is positioned as of today which is more of a (a very thin – as its supposed to be :slight_smile: ) middleware that connect the graph json and the compiled operator library.

Could you say more about how this proposal relates to GraphPlanMemory? I’m wondering if this proposal aims to modify GraphPlanMemory (e.g. to generate the same BufferInfo and then use a similarly pluggable planning function)?

Here I am referencing what is being discussed here : [RFC] [uTVM] Embedded C Runtime Interface - #6 by Mousius. I think its better to reach an agreement there. Here Im trying to illustrate and motivate the design using the APIs.

Ack, will follow-up there.

Yes, the USMP will only touch tir.allocates that could be evaluated in the compile time will be translated to offset. We could just leave the rest untouched to be TVMBAWs that is ultimately handled in the runtime (using malloc or stack_allocator). I think thats the only thing we require. Do we miss anything here? – I’ll adjust the original text to reflect this.

I think this makes sense to me.

Step 3 : Use the updated Map<tir::Var, BufferInfo> to generate Array, Map<String, Integer> pool_sizes
Step 4 : Call the provided/default algorithm (void (*foo)(Array buffers, Map<String, Integer> pool_sizes) to populate pool_id and pool_offset.

Could you clarify Step 3 here? How do we pass BufferInfo to the provided planning algorithm? Can it be a PackedFunc?

T1. --with-parameter-buffer & --with-workspace-buffer

T2. --with-constant-buffer & --with-workspace-buffer

T3. --parameter-buffer & --workspace-buffer

T4. --constant-buffer & --workspace-buffer

Could you give some more information about the arguments to these parameters?

Hi @areusch

This proposal aims to introduce TIR → TIR pass as illustrated above which translates pre-USMP TIR to post-USMP TIR – eventually. Therefore, we are not planning to modify GraphPlanMemory. I think once we express the tvm_main in TIR – the memory planning could be done in TIR – simply because tvm_main could be easily expressed in TIR without needing to carry additional artifacts such as storage ids.

Our current thinking is storage_ids could be captured as tir.allocates. For those who are statically computable will be folded into a workspace buffer while rest will be left to runtime allocation.

Yes, this is something a bit loosely defined in this proposal. Yes, it could be a PackedFunc – however, I’d imagine we would assume the memory planning algorithm to be compute heavy and would require to be performant. Therefore, we are inclining towards having something like TVM_REGISTER_PASS_CONFIG_OPTION to accept a String to choose the algorithm while providing a default. In the pass, we could maintain a String to C++ function ptr map. WDYT ?

Sure, actually there are two orthogonal main choices here (its just combinations made them to be 4 :slight_smile: ). Moreover, feel free to suggest additional options as well.

First, the pooled constants could be seen as a constant-buffer from the perspective of TVM – as they not changed through out the inference. Therefore, we could call them constant-buffer s. However, from a user’s perspective there are somewhat parameters of the model (though the user cannot change them without running a compilation). Thus, that is the argument for them to be “parameter” buffers.

Second, usage of “–with”, this is something came up in our internal discussions of the design of the APIs – simply because it felt natural to say for run the inference “with” those buffers.

Therefore, we would like to hear what the community thinks.

cc : @Mousius @leandron

hi @manupa-arm,

This proposal aims to introduce TIR → TIR pass as illustrated above which translates pre-USMP TIR to post-USMP TIR – eventually. Therefore, we are not planning to modify GraphPlanMemory.

Ok—when you say “load a packed function of the tvm_main instead of json,” do you mean simply that GraphExecutor#run could just call tvm_main? If we make GraphExecutor effectively consume the results of this interface, seems like that would effectively change SetupStorage to issue basically 3 (or maybe a few more) allocate calls:

  1. for the input data (optional)
  2. for the CPU workspace pool
  3. for the output data

there could be additional calls if there are additional e.g. accelerator buffers

I think such a proposal might work to unify the memory planning around this AoT-based approach, but there are some cases which might mean we need to relax this proposal a bit–for instance, the part about passing only the memory pools to operators. it may be that in order to support overriding parameters at runtime (which GraphExecutor currently allows), we need to keep with passing individual function arguments, but these can be arranged (by AOT or GraphExecutor) to merely be offsets into the memory pools (or then be overridden to user-supplied tensors).

Yes, this is something a bit loosely defined in this proposal. Yes, it could be a PackedFunc – however, I’d imagine we would assume the memory planning algorithm to be compute heavy and would require to be performant. Therefore, we are inclining towards having something like TVM_REGISTER_PASS_CONFIG_OPTION to accept a String to choose the algorithm while providing a default. In the pass, we could maintain a String to C++ function ptr map. WDYT ?

I think that we should use PackedFunc where we would like to provide pluggable infrastructure. It should still be possible to provide compute-optimized versions in c++. And it’s still possible to implement registries with prefixes to function names e.g. relay.memory.usmp.

Sure, actually there are two orthogonal main choices here (its just combinations made them to be 4 :slight_smile: ). Moreover, feel free to suggest additional options as well.

I’m more wondering what the arguments to such operators might be–name:key=value type of thing to support attributes on memory pools, or ?

As we get closer to consensus on this, can you write it up as a pull request for the TVM RFCs repository @manupa-arm.

1 Like

Hi @areusch ,

That broadly aligns with our thinking.

The proposed USMP’s actual “component” interface will be quite similiar to TVMC CLI additions. Therefore, graph executor flow could use “–with-parameter-buffer” to make USMP expose the parameter buffer to the actual executor runtime – so that the executor could update constants with known offsets.

Regarding specific parameter updates, since the relay pipeline run passes such FoldConstants, would it be safe to do specific parameter updates ? Anyway, if thats the case, we could use the same way it uses to know which parameters to update to using offsets instead.

Ack, yes maybe we should not limit this in the design.

I see, I think you are querying about attributes of the pools itself.

Initially, we are starting with “name” and “target” to identify the pool uniquely and which targets could access them, respectively.

However, going forward we are going to provide a guide “size” for the buffer to be used, which we could use to distribute tensors (if there are options) based on memory pressure – hence the guide.

Going a bit further out, we are planning append more metadata such as “bandwith” for the buffers, but to be used by scheduler to redact pools based on where they want them placed (in some cases where we might use double_buffering, rolling_buffers using scheduling primitives) – that goes hand-in-hand with performance required.

The proposed USMP’s actual “component” interface will be quite similiar to TVMC CLI additions. Therefore, graph executor flow could use “–with-parameter-buffer” to make USMP expose the parameter buffer to the actual executor runtime – so that the executor could update constants with known offsets.

I think this makes sense to me–e.g. codegen a mapping table from input id to (pool, offset) and supply this to the executor?

Regarding specific parameter updates, since the relay pipeline run passes such FoldConstants, would it be safe to do specific parameter updates ?

Certainly any parameter updates would need to be done in coordination with the compiler passes. We probably need to define additional API functions to allow people to “re-lower” parameters.

I see, I think you are querying about attributes of the pools itself.

Moreover I’m wondering what those attributes would be, and how they may be encoded. The attributes you mentioned mostly make sense. For size, I was expecting the user (or SDK) would specify a max size for each memory pool, and then TVM would output the # bytes actually used for each pool.

It would be interesting to explore whether bandwidth could be determined automatically (e.g. by measuring elapsed copy time) and then used as an input to a cost function which may inform TVM’s decision to offload a particular computation. It’s possible this could be then tied to a measured runtime quantity (e.g. time blocked on memory copy) and then used to validate TVM’s choice of schedule.