[RFC][uTVM] Query intermediary workspace requirement

Motivation

Currently, the TVM codegen’d artifact does not have the ability to express the intermediary memory it requires to perform the inference. This RFC is proposing the initial step in fixing that by enhancing compilation artifacts to be queried for its intermediary workspace/scratch size requirement to execute (fused) operators / sub-regions of the ML model.

This is with alignment with goals for uTVM roadmap : [µTVM] microTVM M2 Roadmap to query the memory required by the intermediates.

Its important that we want the flow to work for BYOC codegens to be able to specify memory required by the intermediates.

Proposal

The plan is to add the following new functions for data structures to be able to successfully query the workspace size each fused operator subgraphs (a.k.a fused operators – these live under a single runtime.Module) and BYOC subgraphs require (there would be a runtime.Module per subgraph). It is important note that we are currently looking to implement the functionality for type “c” and “llvm” that are supported by uTVM. The plan is as follows :

  1. Introduce calculate_workspace_size() function to tir.analysis

We are planning to introduce a functionality to traverse a tir.PrimFunc’s body to calculate the max workspace in bytes it requires and return it.

  1. Modify LLVM and C codegen to provide workspace requirements.

The goal here is to query every PrimFunc for intermediary workspace each (fused) operator requires and make it available in the compiled artifacts.

How would this feature be used ?

  • We can use this to generate an entry in the memory map of Model Libary Format to indicate intermediary activations pool size. This could be calculated as the max workspace sizes required by AoT TIR PrimFunc, fused_operator_1, … , fused_operator_n, external_func_1, …, external_func_n if they are run sequentially.
  • We will be adding an option to use these functionality to generate a static memory buffer (goes to .data) to be used by the stack allocator present in the AoT executor per model, once AoT PR : [AOT] Introducing AOT in TVM by giuseros · Pull Request #7785 · apache/tvm · GitHub lands.

Discussion :

Where should the calculated workspace live ?

B1) runtime.Module : we could have a member containing the workspace it requires for the execution of fused operators inside of it.

This allows BYOC codegens to specify the size as well because as part of the codegen because the workspace size is a compilation artifact. Therefore this option requires only a change in the runtime.Module to carry a member indicating the workspace requirement.

However, this option assumes the functions inside the runtime.Module will run sequentially and does not align well with the fact it is more of “library” of (subgraphs of) operators.

B2) runtime.Module : we could maintain a map of (function→workspace_size) that could be accessed via module.get_workspace_size( func_name ).

This allows BYOC codegens to specify the size for each function as well because as part of the codegen because the workspace sizes for each function are compilation artifacts. Therefore this option requires only a change in the runtime.Module to carry a map indicating the workspace requirement for each function.

This option opens the door for each Executor (Graph/AoT) to consume workspace information to calculate workspace requirement allowing parallel execution of operators if required. Our preference is for this option but would like to hear what the community thinks.

B3) Maintain a map in LoweredOutput :: relay Primitive Function expr → workspace_size (to be passed to BuildOutput → (Graph/AoT)ExecutorFactoryModule → export_model_library)

The current BYOC compilation is relay IRModule → runtime.Module and we might need to change it to give two outputs : relay IRModule → runtime.Module, workspace_size (optional). Then this option will require to plumb this map to exporting of Model Library Format. We think that route will look as LoweredOutput → BuildOutput → (Graph/AoT)ExecutorFactoryModule → export_model_library.

Therefore, this option will not change runtime.Module (to carry an integer indicating the workspace size or func → workspace map) but however we would need to change all the interfaces above.

cc: @areusch @tqchen @giuseros @zhiics @mbaret

2 Likes

Hi @manupa-arm,

Thanks for bringing this topic up! I agree adding a way to express workspace_size requirements would help us with global memory planning.

It seems like we can group the possible ways a relay backend could create TIR into these options:

G1. A PrimFunc that contains TIR that performs the computation → the workspace_size can be directly calculated by traversing the PrimFunc to look for tir.allocate nodes.

G2. (e.g. BYOC) A PrimFunc which calls an extern function; the extern function uses a workspace buffer.

G3. A PrimFunc executed on CPU but which launches compute async-style (I.e. through call_extern) on another DLDevice. Another PrimFunc serves as the synchronization barrier (i.e. awaits completion of that operation).

I don’t think we’re concerned with case G3 right now, but want to consider it to make the point that the workspace buffer’s lifecycle is a property of the computation being done, which right now happens to be tied to the single PrimFunc implementing that computation.

I think right now, using the PrimFunc name as a way to lookup workspace_size makes sense.

Now the question of where this data should be exposed. I’m wondering whether this information would be used outside an overhauled, global version of GraphPlanMemory. Could you speak to how you want to consume this at runtime? It seems to me that the Executor (graph or AoT) needs to ingest a map whose keys are e.g. tensor_id for both workspace tensors and activations and whose values are some addressing information that tells the Executor where to store that data. That addressing information may need to be somewhat specific to the memory model in use at runtime (e.g. purely dynamic vs “pinned”).

If we do need to provide the data to the Executor, I’m concerned with the following situation: theoretically it’s possible to return multiple runtime.Module from a BYOC by importing the others into the root Module. When looking up the PrimFunc, GetFunction handles searching through this module tree. When implementing this metadata lookup, we’d probably have to reimplement this logic to handle cases when multiple Module are returned. I think that both for this reason and since runtime.Module is supposed to be a sort of high-level, runtime-focused interface, it might make more sense to consider consolidating this information into a single map inside the compiler and placing it into e.g. MetadataModule or similar. On the other hand, I do see that workspace_size is effectively a runtime concept, so there’s an argument it belongs in runtime.Module.

I guess this has me favoring some form of B3 if we go this route.

Another consideration to make here is whether extern functions can call TVMBackendAllocWorkspace. If so, we have to support this interface to provide “dynamic” memory allocation (even if we add parameters–a tensor_id–to that API to allow the underlying implementation to be implemented as a lookup table). If not, then we can divorce the implementation of the memory-pool allocation strategy by making more aggressive optimizations in the TIR:

  • Presuming that each memory pool supplied by the user has a global symbol.
  • Calculating the address with pointer arithmetic relative to that global symbol.
  • Replace tir.allocate nodes with direct references to tensors based on this computation

This strategy may have a few rough edges to work out wrt concurrent model execution, however. If we do adopt it, it implies a few things:

  • it removes the need to support memory-pool style allocation in TVMBackendAllocWorkspace and thereby avoids an API change
  • it should be faster
  • it confines problems like this to the compiler and means the executor doesn’t need to consume any sort of data structure downstream
  • it means that extern functions can only leak memory in the tensors they’re provided (e.g. TVMBackendAllocWorkspace could be unimplemented when using memory pool-style allocations).

Hi @areusch,

Sorry for the delay in getting back, lot is going on these days :slight_smile:

I think in the long run we’d need to expose the memory required at a “pool” level rather than at the tensor or storage_id level to the user. For starters, we can aim to generate memory required by the pool of intermediary tensors in MLF. For AoT, I feel like we need an interface to set the intermediary workspace pool, possibly using set_workspace() API in the runtime shim and in the AoT TIR main_func we need to pass it in as a argument – @giuseros . Thus, this provides the ability to dynamically or statically allocate the data in the main application. I’m not sure how that plays out with graph executor nor should it matter there given that in the long run it ll be used for tuning. I might be missing some scenarios here, therefore feel free to enlighten me.

Yes this is the hard bit of the problem :slight_smile: . The crux of it being the runtime.Module a multi-faceted it starts as a compile-time concept and get exported into a runtime concept. I might not fully understand how “pure” we need to keep the runtime.Module and the motivations there. i.e., is it acceptable to shed some information where we export the compile-time runtime.Module ?. However, the fact remains the workspace size is closely coupled to “library” of operators which is the runtime.Module.

Depending on the anwser and reasoning – we might need one of the following :

C1) Yes – we can have metadata stored and be accessible until exportation in the runtime.Module

C2) No – then I’d say we need a wrapper class to hold the runtime.Module in the compiler to carry the metadata rather than expanding the current APIs in the build pipeline. Thereafter, executor_codegen’s will shed the metadata. WDYT ?

I think we need to hoist the allocations to main TIR func and perform the unified global memory planning there (I think we should not run StorageRewrite inside PrimFuncs, rather they should be hoisted to main to avoid getting stuck in local optima). So we were keen on using StorageRewrite here vs GraphPlanMemory – though it seems the more or less same the difference being whether it acts on Relay vs TIR. The unification ability should be defined per backend : all the internal tensors seperately provided vs grouping them based on the backend (creating pools) on which they are run. I think we need to discuss how to do unified memory planning in a seperated RFC.

I think scope of this RFC is to compute workspace requirement on a primfunc basis and enable the AoT to function without the user having to “guess” the memory buffer size. I think the current stack allocation mechanism used in the AoT will work fine with calculated numbers.

I think in the long run we’d need to expose the memory required at a “pool” level rather than at the tensor or storage_id level to the user.

I agree with this. I think that the memory planner should ingest e.g. a list of available named or id’d memory pools and their sizes; and output the same list with the sizes adjusted to match the required memory. I also would advocate for id’ing each Tensor used in computation, and outputting an ID → (pool_id, offset) map, even if not required in runtime. Such a map could be useful for debugging.

GraphExecutor currently ingests a set of storage_id, which describe shared buffers occupied at disjoint times by various tensors. The tensors are identified by node_id, which is an index into the “nodes” key in Graph JSON of the tensor’s producer. It seems like the changes would be:

  1. broadening storage_id to (pool_id, offset)
  2. defining an API to provide GraphExecutor with the base addresses of the memory pools

so I think quite compatible.

The crux of it being the runtime.Module a multi-faceted it starts as a compile-time concept and get exported into a runtime concept

This is a huge problem and one I want to solve, but it’s very tricky. I’d like to eventually move the tvm.relay.build output to a collection of “ready-to-save” POD types e.g. named Artifact. And then define an explicit load process which is easy to replicate in firmware. Given runtime.Module mostly defines runtime-facing functions, I think we should avoid adding compiler-facing stuff to it. But, there are a lot of pieces touched by that so I don’t think we can block memory planning on waiting for Artifact.

I might not fully understand how “pure” we need to keep the runtime.Module and the motivations there. i.e., is it acceptable to shed some information where we export the compile-time runtime.Module ?

Do you mean: could we place workspace information on a Module and then not write to disk in save()? I think I’d prefer to keep the workspace information outside of Module and instead modify the BYOC interface to return compiler-facing data structures. Then, BYOC could return a mapping of function name -> [(tensor_id, bytes_needed)].

So I guess that’s C2 from your suggestions.

A wrinkle is whether or not all memory pools will be treated equally–likely not, if some pools are intended to hold inputs to e.g. accelerators. In that case, we may need to assign a pool_type string to each memory pool, and allow accelerators to specify that for each tensor_id as part of this mapping.

I think we need to hoist the allocations to main TIR func and perform the unified global memory planning there (I think we should not run StorageRewrite inside PrimFuncs, rather they should be hoisted to main to avoid getting stuck in local optima).

Okay, given that, the other option to what I stated above is to simply require BYOC-generated code to accept workspace tensors as arguments.

I think we need to discuss how to do unified memory planning in a seperated RFC.

Ack. The main question I have though is: if we are just going to hoist tensors out of operator implementations, why do we need to have a way to lookup PrimFunc workspace size? Can’t we just get that by looking at the arguments?

We think they will get hoisted to main TIR primfunc that is also a primfunc – then there is an argument whether the unified planner reduce that into pools later. So this piece of work is a interrim solution until we have the comprehensive planner (which we think is bit bigger amount of work) to enhance the user experience by not having to guess the workspace buffer size that will be used by TVMBAW calls.

I’ll anwser other questions tommorow.

Okay that makes sense. I think if we’re able to hoist the workspace allocations as part of this interim solution, then it seems like they should be visible already to GraphPlanMemory, and then easy to include in a size estimate. So just curious whether you view hoisting workspace allocations as part of the comprehensive planner work, or as something we’d do more immediately?

-Andrew

We think the immediate concern is about getting the workspace size that works with the current TVMBAWs (with the stack allocator) – and not to touch the codegen of operators, immediately. I think, we can discuss and plan that work (hoisting of internal tir.allocates) after we land some functionality of AoT in the codebase.

We are currently looking at supporting two use cases in the short term (to work with TVMBAW calls). To illustrate the proposal see the following (the actual APIs might change on what we agree upon in the AoT PR/RFC but this is the intention) :

  1. User setting the workspace in the main.c application manually by looking at a special entry in the memory map of MLF, indicating the workspace size required.

metadata.c

  ...
   const tvm_model_t network = { .run_func = ... , ... }

main.c

   static uint8_t workspace_buffer [WORKSPACE_SIZE];
   extern tvm_model_t network;
   tvm_workspace_t app_workspace;
       int main(...) {
           ...
           StackMemoryManager_Init(&app_workspace, workspace_buffer, WORKSPACE_SIZE);
           tvm_runtime_run(&network, inputs, outputs);
       }
  1. Codegening (optionally) a workspace buffer to .data along with tvm_model_t.

metadata.c

   ....
   static uint8_t workspace_buffer [WORKSPACE_SIZE];
   const tvm_model_t network = { .run_func = ... , ... ,  .workspace_buffer = &workspace_buffer , .workspace_size = WORKSPACE_SIZE}

main.c

   extern tvm_model_t network;
   tvm_workspace_t app_workspace;
       int main(...) {
           ...
           StackMemoryManager_Init(&app_workspace, network.workspace_buffer, network.workspace_size);
           tvm_runtime_run(&network, inputs, outputs);
       }

Okay makes sense, I understand that prioritization. I agree modifying the signature of all codegen’d code is going to be invasive and will probably take time.

In that case, then, maybe this would be a possible plan?

  1. build a compiler pass that extracts all tir.allocate nodes in a PrimFunc. I believe @jwfromm has this on a branch somewhere.
  2. modify the return value (or otherwise provide a new output getter) of all Codegen to return a new top-level data-structure describing the workspace tensors used and the DLContext which uses them.
  3. For non-BYOC codegen, the implementation may just be invoking the compiler pass on the supplied PrimFuncs, or if needed each codegen could tweak this.
  4. In AOT codegen, consume the top-level data structure as an input, and produce the global workspace as you describe.

Then moving forwards, two longer projects:

  1. hoist workspace allocations out of operator implementations and (probably) remove the ability for non-BYOC codegen to explicitly provide workspace information.
  2. use the compiler pass in the top-level TIR function to do workspace size planning.

Thoughts?

cc @csullivan @jroesch

Hi @areusch ,

Yes, that sounds similiar to what we were planning to do.

Immediate plan :

1.) We can use this : [TIR] An analysis pass to calculate workspace size for primfuncs by manupa-arm · Pull Request #7859 · apache/tvm · GitHub – we dont need to “extract” (or mutate the IR – we just need an analysis pass rather than a transformation pass). We can think about right mutation in the longer project 1.

2.) I am currently thinking of the following :

struct FunctionInfo{
    ....
    size_t workspace_size;
    DLDevice dev;        
}

class LoweredOutput {
  Map<String, IRModule> lowered_funcs;
  Array<tvm::runtime::Module> external_mods;
  Map<relay::Function, FunctionInfo> function_metadata; (new)
  std::unordered_map<std::string, std::pair<int, const tvm::runtime::NDArray>> params;
};

We can use FunctionInfo and ExtractFuncInfo function OR we can invent a new data structure to hold the info. At the moment, we think we only need to hold information related to a PrimFunc.

3.) So the compile_engine (or the refactored name it’ll get called eventually), could update the function_metadata by running the compiler pass for non-BYOC targets. Similarly, we could modify the BYOC codegens to return the workspace size required to execute the relay::function as compiled via the BYOC codegen.

4.) That is right. We just need to consume the function_metadata in AoT codegen to generate the stub in the metadata module.

Longer projects :

1.) Yes, we should do this in a higher-level via hoisting.

2.) Agreed

1 Like

We can use FunctionInfo and ExtractFuncInfo function OR we can invent a new data structure to hold the info. At the moment, we think we only need to hold information related to a PrimFunc.

one question with this: if we just report aggregate statistics per-function, then functions which use multiple workspaces may need to perform a sort of “mini-memory-planning” (e.g. mapping each internally-used workspace to an offset within the passed-in block) until we hoist those tensors into the top-level TIR function. How do you plan to handle this?

struct FunctionInfo{
   ....
    size_t workspace_size;
    DLDevice dev;        
}

class LoweredOutput {
  Map<String, IRModule> lowered_funcs;
  Array<tvm::runtime::Module> external_mods;
  Map<relay::Function, FunctionInfo> function_metadata; (new)
  std::unordered_map<std::string, std::pair<int, const tvm::runtime::NDArray>> params;
};

this seems like good first pass at this, though might be wise to key function_metadata by a string key and make FunctionInfo a Node subclass so it can be passed through the FFI.

one question with this: if we just report aggregate statistics per-function, then functions which use multiple workspaces may need to perform a sort of “mini-memory-planning” (e.g. mapping each internally-used workspace to an offset within the passed-in block) until we hoist those tensors into the top-level TIR function. How do you plan to handle this?

Currently all the workspace tensors are serviced by TVMBackendAllocWorkspace calls and they sort of do its mini-memory planning by running StorageRewrite on each PrimFunc. Therefore, until we hoist / mutate the IR the total workspace size is sufficient to be serviced by TVMBackendAllocWorkspace calls.

Noted; we’ll make it FFI passable. I see name is generated by the compile_engine and not associated the relay::function. I 'll see to it and we can discuss in the PR.

The support for first use-case is added here :

1 Like

@areusch @manupa-arm I have a question regarding the implementation of this feature:

In this thread the ability of BYOC codegens to specify their memory requirements was discussed thoroughly.

I was wondering if this specific feature is unimplemented or still on the Roadmap, because if it is already implemented, I was not able to find out how to use it.

Allowing BYOC kernels to use TVM-allocated memory would be a really nice feature, especially because of its advantages in terms of memory planning.

I would be grateful for a short reply.

Hi @PhilippvK ,

There had a been a bit of development on other fronts that made us think we dont need a seperate way to express intermediary workspace used by BYOC external functions.

Please have a look at the Target Hooks RFC (cc : @Mousius )

The issue with current BYOC is that it is a complete bypass of the core compiler of TVM which we try to alleviate by the mentioned target hooks. Therefore, once the target-dependent lowering could expose the TIR PrimFuncs for the core compiler we could hook it up current workspace calculation (that is working on TIR PrimFunc basis). Moreover, it will allow the using memory planning that is being proposed in the USMP RFC.