[Discussion/Alignment] Memory Planning

r.stahl · April 16, 2021, 10:43am

There are quite a few mentions of new global memory planning, but with this I would like to start a more comprehensive discussion and would be interested in your current plans. I’m working on model transformations that aim to reduce memory usage on memory constrained devices. For this work to show full potential, I would need to implement a few changes in the memory planning of TVM. So this is also meant to align our goals.

Current status as far as I understand:

GraphPlanMemory: Goes through the final Relay IR data-flow and allocates new buffers whenever required. Every buffer is a distinct storage and buffers can only be placed at the beginning.
Intra-TIR function memory allocation: Temporary buffers for operations.

Proposed goals:

Unify these two methods of allocation into a global memory plan. The first important step towards that seems to be approached through this: [RFC][uTVM] Query intermediary workspace requirement. I do not have anything to add in this regard.
Improve method of planning. In my opinion, there are two good approaches:
- Use a heuristic like in TFLM to find an often optimal solution with low computational complexity
  - In short: Sort buffers descending by size; loop through buffers, trying to place them at the lowest address of a linear buffer; if there is a lifetime-conflict, still place them at the lowest possible address, therefore finding “gaps”. (see https://github.com/tensorflow/tensorflow/blob/master/tensorflow/lite/micro/memory_planner/greedy_memory_planner.h)
- Find the optimum with a mixed integer programming optimizer. This is pretty simple and optimal, but of course has higher computational complexity. It could look something like this:
  OIbd1jQBaS727×328 33.6 KB
  - (1) sets the objective to minimize the largest ending offset
  - (2) sets up constraints for ending and starting offsets
  - (3) ensures positive offsets
  - (4) sets up constraints against overlaps for conflicting buffers. conflicting buffers would simply be buffers that have overlapping lifetimes

Both approaches requires an input in the form of a list of buffers with the attributes “size” and “lifetime”. Or more generally than “lifetime” could be just “conflict”.

My current plan to approach this is to change GraphPlanMemory into a function that analyzes the given Relay IR, but instead of returning the storage_ids, it would just return the buffers with sizes and lifetimes. Then the memory planning can work as an independent unit which transforms the given buffers (which can also include the intra-TIR function ones) to a static memory plan that would just consist of one “storage” for all storages that can be combined. Additional conflict constraints would come from the buffers being on different devices.

Open questions:

Was there a reason why the buffers are only placed at the start of storages?
Maybe related to above: How to handle exotic alignment requirements?
How to align this with non-static memory plans?

JosseVanDelm · April 16, 2021, 11:42am

Linking GitHub - pulp-platform/dory: A tool to deploy Deep Neural Networks on PULP-based SoC's As a possible way of memory planning on microcontrollers by constraint programming.

areusch · April 20, 2021, 5:47am

hi @r.stahl @jossevandelm,

Thanks for your post! I think this is a very interesting topic and a natural next step for microTVM. It would be great to have some detailed discussion around the next steps, as I think there are quite a few interested parties and timelines may vary. I’d prefer if we could move in a direction that could accommodate as many as possible.

I think your status summary matches my understanding.

I’d like to approach the next steps of this in a manner similar to how you’ve split the problem:

Add the ability to do global memory planning
Add (some) interesting memory planner implementations

In particular, I’m not completely convinced that it’ll be possible to build a single global memory planner that handles all scenarios well. And, as @jossevandelm alluded to, it also seems possible to me that we may want to perform a graph-level search over possible memory configurations. I’d like to adopt a solution that allows the user to choose the memory planner implementation if they want–I think this also dovetails well with the goal of enabling different parties to do memory planner work in parallel.

To that end, I think the first step here is to post and implement an RFC describing a global memory planner interface. As part of this, we should describe the way that TVM will model memory. This hasn’t been settled yet, but I believe general sentiment has been gravitating towards defining a set of named memory pools (e.g. given by (name : str, size_bytes : int)). A strawman interface would probably look like:

PlanMemory(device_memory_pools, prim_func) -> [(tensor_id, pool_name, pool_offset)]

Where tensor_id is a unique integer identifying the TIR buffer and prim_func is the function generated by the AOT PR. It would be nice if it were possible to separate the concept of a lifetime from TIR itself, but it might be tricky to do this if we introduce control flow in that top-level TIR function (which I think is a very likely scenario). Another thing to consider is that making prim_func a part of this interface implies we generate the top-level TIR function and do analysis on this for both Graph and AOT codegen. I think that seems like a natural evolution (to join the two codegen paths), but it should be discussed.

Other considerations here are things like dynamic shapes and whether people may implement graph-level optimizations which may affect memory planning. I think if you want to write up a PoC or proposal here, feel free (particularly curious about the question of describing lifetime in a way that’s disconnected from TIR). I think also some of the ARM folks may be interested in participating here or may have their own proposal to make.

Answers to your questions:

Was there a reason why the buffers are only placed at the start of storages?

I don’t know, but I suspect it was mainly done since storages are per-buffer.

Maybe related to above: How to handle exotic alignment requirements?

What do you mean by “exotic?” Currently my thinking is to support word alignment. For memory which needs to be placed in a specific region, my thinking is that the named pools are user-facing and the names should mean something to the user. It should be possible to use the memory map produced by Model Library Format to define those memory pools in the C project on terms acceptable to the project’s linker script.

How to align this with non-static memory plans?

This is a big question I’m not sure about. I imagine something like this:

split the graph into static and dynamic tensors
dynamic tensors have size ranges; initially the max is expressed as sizeof(X) * n, n an expr in terms of input shapes
the upper bounds of such size ranges become bounded by solving constraints once memory planning is complete
offsets are computed at runtime using the expr defined above

cc’ing a few others who may have input here.

@junrushao @tqchen @jroesch @manupa-arm @mbaret @ramana-arm @stoa @aca88

Andrew

r.stahl · April 20, 2021, 1:05pm

Hi @areusch,

thank you for your detailed feedback!

I agree with you that it makes sense to have an interface for different possible memory planners. Your proposed interface would already fit all my requirements.

Memory Pools: These seems nice for handling different memory properties with their constraints (like GPU memory), but I’m not convinced that they are the best tool for the property of size. The optimization process would become pretty hands-on, because if my application size changes, I may have to adjust the “remaining” memory pool for the ML model and everytime make sure that memory is not wasted or performance suffers too much from too constrained memory. What we are used to from C compilers is a priorization for performance or memory (-O3/-Os). In the non-prioritized category it still tries to do the best possible job while avoiding unreasonable trade-offs. In the micro-world I guess a higher amount of control for this trade-off could be desired, but not really for all TVM users, right?

Regarding lifetime and TIR: it was not entirely clear to me when the dataflow representation becomes the schedule where we can start analyzing lifetimes. I saw that at the end of Optimize, the AutoScheduler is invoked, so I assume this is the point where the schedule is regarded as final. Is this accurate? It is a bit unclear to me because the representation is still only for dataflow.

Alignment: What I meant is that there could be alignemnt requirements that need more than word alignment, like GPUs or certain SIMD operations. But this could be a property of the “pool” you mentioned and then be accounted for by the planners when placing buffers at offsets.

areusch · April 20, 2021, 5:38pm

hi @r.stahl,

Great! I’ve added a few follow-up thoughts below. I am happy to put together an RFC around a memory planning interface, but it might be next week before it’s posted. Some follow-up replies:

The optimization process would become pretty hands-on, because if my application size changes, I may have to adjust the “remaining” memory pool for the ML model and everytime make sure that memory is not wasted or performance suffers too much from too constrained memory.

For microTVM, this is where I think project generators could help. My thought is that we would place the memory planner output in metadata in Model Library Format, and then downstream Project API implementations could consume that and declare global buffers sized to match the application demands.

On non-microTVM (e.g. traditional OS), I think we could still dynamically allocate memory pools using e.g. malloc, but this would effectively handle all of the system memory allocation with one malloc call per memory pool. GPU-specific memory pools can be allocated by the underlying GPU driver (however, some more work needs to be done to link the memory pool with the TVMDevice that provides it).

I saw that at the end of Optimize , the AutoScheduler is invoked, so I assume this is the point where the schedule is regarded as final. Is this accurate?

Right now, we don’t finalize the schedule until we’re already in GraphExecutorCodegen, so I think this may be the source of the confusion. I would like to move to a world where GraphExecutorCodegen conumes the top-level AOT TIR (or some similar thing).

Then later on after PR 7518 is merged, we would like to be able to define compiler passes over the entire scheduled program. This would allow for full dataflow analysis, which until this point, will need to just be confined to the single top-level TIR function generated from AOT.

r.stahl · May 25, 2021, 9:49pm

My proposal is now implemented.

I ended up completely replacing the content of graph_plan_memory.cc with a python implementation:

Redirect to Python: tvm/src/relay/backend/graph_plan_memory.cc at e9184d948edd58635e79c3f21355f2b83b361401 · tum-ei-eda/tvm · GitHub
Main implementation: tvm/python/tvm/relay/memplan.py at tumeda_memplan · tum-ei-eda/tvm · GitHub
Support code: tvm/python/tvm/relay/memopt at tumeda_memplan · tum-ei-eda/tvm · GitHub

To work with the data-flow graph more conveniently this adds networkx as dependency. For the optimal solution, this would drag in the optimizer OR-Tools as well. The code is currently also pretty filled with debug functionality that might not be wanted by TVM.

How to contribute this back?

I have started a PR to add the possibility to return offsets for buffers: memory planning: add offset to planning output and respect it in graph executor by rafzi · Pull Request #8134 · apache/tvm · GitHub
Suggestion: Add the TFLM inspired heuristic as default and leave the optimized version as external extension that plugs into an interface? That would drop a lot of code and could be implemented without pretty much anything in the memopt folder.
How should the interface to a memory planning implementation look? How about a registered function that is called only if it exists? Are there any similar optional extension interfaces?
Any other suggestions?

First results are shown below. This is evaluated on RISC-V and using the code generator. There still might be some issue, because I would not expect the RAM usage to rise in any example. But looks promising so far with 10% reduction for a cifar10 model and 18% for resnet!

aca88 · June 2, 2021, 4:11pm

Hi,

Very interesting work.

I was wondering about the need to get all topological sorts. How robust is this method given that there are many many topological sorts to general DAGs?

I guess most manual designed networks will have some simple structure which makes getting all topological sorts feasible, but NAS designed networks might pose a problem.

Thanks

.

r.stahl · June 2, 2021, 5:28pm

Hi @aca88

thanks for your interest!

For the evaluated models, we just used a single schedule as given by TVM: tvm/memplan.py at tumeda_memplan · tum-ei-eda/tvm · GitHub

You are right that for more complex graphs, we would have to evaluate more schedules to find the optimal solution.

However, getting all topological sorts is a bit of a lazy overkill. I have not looked into this in detail yet, but some thoughts:

Only need to evaluate parallel paths independently
On nested parallel paths, work inside-out
The optimal order within parallel paths is dependent on the maximum and minimum usage of the individual nodes. I’m sure there is a good algorithm for this, better than the brute-force approach

areusch · June 3, 2021, 9:57pm

hi @r.stahl ,

Thanks for posting your implementation!

First results are shown below. This is evaluated on RISC-V and using the code generator . There still might be some issue, because I would not expect the RAM usage to rise in any example. But looks promising so far with 10% reduction for a cifar10 model and 18% for resnet!

These are great results!

It would be awesome to find a way to contribute this to TVM. Here are some thoughts along those lines…

We’ve now landed initial work towards the AOT executor, and there’s been some parallel work to do memory planning with AOT executor.
The AOT planner currently uses GraphPlanMemory but there is a similar proposal to replace GraphPlanMemory with something else.
It would be great if we could come to a single memory planning interface and use that for both Graph and AOT memory planning.
With AOT, memory planning happens at the TIR level, which I think is slightly better as it allows for planning scratchpad/workspace memory alongside intermediate/output tensor memory. However, I think the fundamental inputs to any memory planning algorithms are similar.

In [RFC] Unified Static Memory Planning, there is a proposal for a memory pool-based planner interface. Could you provide some input on the interface planned there, and see if it’s compatible with your work here? It seems like we could move forward by merging that interface and replacing GraphPlanMemory with something that uses the new interface. Then, if I’m understanding correctly, the planner work you’ve done here with networkx could become an implementation of that interface.

Previously you’d critiqued memory pools, so I also just wanted to follow-up:

What we are used to from C compilers is a priorization for performance or memory (-O3/-Os). In the non-prioritized category it still tries to do the best possible job while avoiding unreasonable trade-offs. In the micro-world I guess a higher amount of control for this trade-off could be desired, but not really for all TVM users, right?

My thought is that memory pools still work as an abstraction here, and there can be additional parameters provided to any memory planning algorithms to enable tradeoffs such as these. If the user wishes to try for highest performance, they can offer as much memory as they can to the planner to see if it improves things.

I think in general it’s good to retain flexibility in TVM, as use cases tend to be quite varied. We may need to ensure there are sane defaults, though. Is there a use case you had in mind where the additional control is a drawback?