[RFC] Compute graph pipeline with new subgraph executor

This is a follow up RFC for Compute Graph Pipeline

PR [Runtime]Pipeline Compute Graph With New Subgraph Executor by huajsj · Pull Request #7892 · apache/tvm · GitHub

Split relay graph into Subgraph then doing Subgraph Pipeline :RFC

In this RFC, we present a new framework for subgraph pipelining that first split relay graph into multiple subgraph/group operator s then doing subgraph/group operators scheduling , this frame work would include 3 parts, the logic is self contained , PR submitted as [Runtime]Pipeline Compute Graph With New Subgraph Executor by huajsj · Pull Request #7892 · apache/tvm · GitHub

#1. Relay graph splitting.

all the logic in this function analysis.py:pipeline_graph, this function input is a “relay graph” and an integer array to represent

the subgraph split logic and the output is multiple "relay graph " as “subgraph” that can get use for future pipeline scheduling.

#2. Runtime/Executor and Interface

subgraph_executor.cc

subgraph_executor.h in folder src/runtime/subgraph/

and

subgraph_executor.py, are the subgraph executor/runtime logic, that implement the interface to interact with frontend part logic to

“create subgraph executor”,“set input”, “run” "get output " and “stop”

#3. Scheduling and Memory movement

other file except runtime/executor files mentioned in #2, implement the “pipeline scheduling”, “memory movement” and “Syncolization”

logic.

Before this RFC, I have an Initial version RFC that just discuss/implemented the “Relay graph splitting” and as we discussed, that seems like not

be self-contained, then I create this new RFC with subgraph runtime/executor support to make whole logic be self contained and review friendly.

we also have some discussion about integrate subgraph into BYOC to avoid redundant code, integrate into BYOC is convenience for these backend

that use BYOC mechanism, but such pipeline logic would not available for these backend that not go through BYOC, for example VTA, for such reason, hence we keep the subgraph split logic as a independent function to make this pipeline feature both available for BYOC backend and non-BYOC backend.

Motivation for subgraph pipeline

currently a lot of edge device use for inference is SOC, the idea scenario is we can use multiple chipset for compute together and parallel, one existing solution is to do batch processing on different chipset, for example processing image1 on cpu, image2 on gpu, image3 on FPGA,

but one issue is different chipset have different latency , slow chipset like cpu may be 10+ time slow then fast chipset like FPGA, when use whole compute graph as a schedule unit would involve huge latency for processing even we may get a better throughput. hence we need smaller schedule unit like subgraph as the schedule unit to more efficient to use SOC heterogenous hardware resource and reach better performance.

another benefit of subgraph pipeline is to provide the capability to use more backend parallel for inference, with the RPC help the backend

joined into the subgraph pipeline can even locate remotely on cloud.

in our use case we start to think about subgraph pipeline feature when we cooperate with autoware to use VTA as opensource vision control unit solution on ultra96 board, that time we experience performance issue on image processing throughput, VTA use CPU to process first conv layer and FPGA+CPU for other layer, when CPU is doing first layer processing, the FPGA actually is IDLE, then we give try to split the subgraph into 2 part and doing pipe line, we get about 30% + performance improvement without any hardware change.

Background and Prior Work

This subgraph runtime/executor relies on graph_executor for operator execution and storage set up, it is one more wrap up graph_executor,

Goals&Scope

The goal of this new framework&runtime is to provide a solution for such requirement to get performance improvement by doing operator/subgraph level scheduling. In this initial version, what we developed include

#1. a function to split compute grah

#2. a tiny runtime/executor,

Some features may useful but not available in this version are

#1. the subgraph split logic coming from user define, the automatic split logic generating or optimizing not there yet

#2. There is no CPU affinity configuration for these threads that running Pipeline control logic in this version

Architecture of solution

In this RFC and PR, we split our work into 3 parts, from top to down is as following

Module 1: Relay graph split logic Analysis.py:Pipeline_graph

Function of module 1 is to split relay graph into a group of relay subgraph, following is an example for a simple network split

there are 2 input for this function No1 “Relay graph”, No 2 “Split array” , the output is Array of “module”, this function did 2 mainly work, first is to split one relay expression into multiple independent

expression, secondly rebuild the meta data reference to make sure new expression can find meta correctly. the detailed logic can get find in pipeline_graph function.

Module 2: Subgraph runtime/executor[Interface]

After split a Relay graph into a group of subgraph, user can build these subgraph with any backend they prefer, then subgraph runtime would do the job to scheduling these module/library in

pipeline model.

Subgraph runtime can get split into 2 module, One is the interface part that response to interact with caller to set data/parameter/modules that need to get schedule and doing the initialization/

run/stop work, Second part is the scheduling logic that response to pipeline the subgraph and transfer data between different subgraph, here we only talking about the Interface part.

Subgraph runtime[Interface], have 3 files subgraph_executor.py, subgraph_executor.cc , subgraph_executor.h,

subgraph_executor.py is first interface that use to receive data and parameters for subgraph_runtime creation, internally it calling Cython function that exposed by subgraph_executor.cc/.h to create “SubGraphRuntime” instance in c++ part, then go through these functions like “SetInput”, “GetOutput” that provided by “SubGraphRuntime” to set input and get output from C++ part(subgraph_executor.cc/.h).

Another work did in subgraph_executor.py is that it would reate “graph_executor” module for every subgraph module, then the “module” information of “graph_executor” would send into c++ part as

the scheduling unit for pipeline logic

subgraph_executor.cc/.h is the c++ part logic that interact with both caller and lower level scheduling logic, there is a class named “SubGraphRuntime” that use to do control level scheduling logic,

like run/stop etc, it also provide the storage service to storage variable that use for scheduling work.

What this part logic did as following

Module 3: Subgraph runtime/executor[Scheduling logic]

Scheduling logic consist of four files, “subgraph_function.cc/.h”, “subgraph_data.h”, “subgraph_struct.h” this modules doing following work.

#1. Maintain a pipeline thread pool

#2. data forwarding between different subgraph

#3. cross device data copy

#4. set input data for different subgraph

#5. polling output data

#6. synchronization logic for data arrive notification

Following are some diagram how these files cooperate

Subgraph create

Subgraph run

High level view of Pipeline process

3 Likes

@tqchen @comaniac @areusch @tmoreau89 @zhiics @giuseros.

Here are my two cents before diving into the detail code review.

  1. At the first glance most implementations, including the Relay passes, were done in Python. It would be better to implement them in C++ for better performance.
  2. The term and namespace “subgraph” is improper and confusing.
  3. It would be better to break the PR down to 3 smaller PRs (one PR per component) for reviewing. Each component should have its unit tests with hand crafted inputs instead of a single set of unit tests to test all 3 component.

Some questions for each component:

Compoenent 1: What’s the current split logic? From the PR it seems to me that you simply split every op to a subgraph? Ideally we should perform dependency analysis and minimize the number of subgraphs.

Component 2: It seems not ideal to let users build each subgraph manually, and from your test case, it seems like users have to make sure the built runtimes are in the correct order? The order should be guaranteed in some artifacts.

Meanwhile, it would be better to keep the components independent as well. Specifically, the APIs could be

mods, artifacts = split_graph(mod)
libs = build(mods, artifacts)
runtime = create(libs)
runtime.set_input(...)
runtime.run()

One important point is the build API does not have a strong dependency to split_graph. Suppose a user prepares a list of modules as well as the artifacts by herself, the build API should also work. That’s why I feel the first and second/third components should be in separate PRs.

@comaniac , thanks for the comments, following are related answer for questions.

#1 At the first glance most implementations, including the Relay passes, were done in Python. It would be better to implement them in C++ for better performance

about “pipeline_graph” logic, because this part logic just do one time relay graph enumeration, the performance different between c++ and python should be very tiny, at same time we also found some existing logic like VTA:graphpatch.py that did similar graph work also implement in python level, then we think we my can ignore this tiny performance different then leave it in python level.

The term and namespace “subgraph” is improper and confusing

the split logic is to split a “relay graph” into a group of small “relay graph”, and all new small “relay graph” is part of original “relay graph”, then we named it subgraph, the subgraph would get pipelined for compute, could I get more information about why this improper and confusing, do you have any recommend name?

It would be better to break the PR down to 3 smaller PRs (one PR per component) for reviewing. Each component should have its unit tests with hand crafted inputs instead of a single set of unit tests to test all 3 component.

This make sense , actually that is what we plan to do at beginning as mentioned in former RFC(Compute Graph Pipeline - #13 by hjiang), that time as we discussed, a PR without runtime seems like be not self contained and difficult for review, then I create this new PR with runtime logic to make all logic be self contained and review friendly

But I also understand that a serial of small PR should be more easy for review, I would put the PR information in each PR to make sure reviewer can know these PR relation to get whole image about the feature

about PR split, the first module and second module have no strong dependency, for sure we can split them, but second module with third module are tightly coupled, we may need to keep them as one whole PR, please let me know how you think?

Compoenent 1: What’s the current split logic? From the PR it seems to me that you simply split every op to a subgraph? Ideally we should perform dependency analysis and minimize the number of subgraphs.

the split logic is coming from manually configuration, we not split every op to a subgraph, for example for a “relay graph” that have 5 operators, if the split logic is [1], then we would get 2 subgraph, first subgraph include these operator {0,1}, second subgraph include these operators {2, 3, 4}, internal dependency not get change in both subgraph, and external , subgraph2 input data should be subgraph1 output

Component 2: It seems not ideal to let users build each subgraph manually, and from your test case, it seems like users have to make sure the built runtimes are in the correct order? The order should be guaranteed in some artifacts

yes currently user should make sure the correct order, but use a single build may have 2 problem, first user lose the control for the build configuration, like optimize level, by pass which operator, put these information into artifacts seems may make artifacts be too complex and difficult for use, secondly even use artifacts and single build, user still need to put the order information.

to fix such issue, how about add a device list parameter for subgraph_create like following, to guarantee the order is correct?

dev_list = [“cuda”, “cpu”, “vta”]

runtime = create(libs, dev_list]

This is not an encouraged approach, even it’s performance impact may not be high. Another obvious concern is some users only use the TVM C++ APIs. If the pass is implemented in Python, then it’s hard for them to make use of them.

The term “subgraph” has been widely used in TVM so this is term is definitely confusing, whatever if it is reasonable for this particular component. It would be better to have a more specific name such as pipelinable graph, but if the graph splitting is just a pass that splits a Relay graph to multiple small graphs, we don’t even have to introduce a new term for that, because these small graphs are also Relay graphs.

Seprating the PR won’t make it hard to review as long as you provide sufficient information, hand craft test cases, and still make it self-contained. IMHO, if I were you, I’ll probably send the executor PR first, because in this case you can demonstrate that this component can work well even with hand-crafted modules, and people can easily know what inputs are required to run a list of module in pipeline fashion.

Ok I found your use case:

pl = [0, 2, 3]
mods = pipeline_graph(mod["main"], pl)

To be honest this is totally unacceptable. It means users have to dig into the Relay graph and specify the splitting points precisely. And how about the Relay graph with branches (e.g., Inception) or multiple functions? My feeling is the approach of splitting a graph needs much more discussions, so if you don’t want the upstream process to be blocked, I would suggest upstreaming the executor and runtime as I mentioned above.

Order and single build are two separate topics. IMHO, I didn’t see a strong motivation that we need to build each graph separately. To me, asking users to build each module is more complex in most cases.

In addition, asking users to provide a devie list is also not a good idea. We should leverage the TVM target system to keep a unified target interface.

@comaniac, thanks for the comments, I think the suggestion is to send subgraph executor as first PR for review then do more discussion later about graph split part logic, that make sense to me, I would do the related work for PR split with addressing your comments about name and hand-crafted module etc.

Hi @comaniac , subgraph executor get renamed into pipeline executor, and the PR now only include pipeline executor and schedule logic, could you help for a review?

@comaniac the code number of “compute graph pipeline” patch [Runtime]Pipeline Executor For Compute graph pipeline by huajsj · Pull Request #7892 · apache/tvm · GitHub is closed to 2k and be a big patch, thanks for the kindly review that help moving forward the upstream process, I am planning to split this patch 7892 into a serial of small patch to make the patch code more readable and review friendly, please advice if such plan may help?

Thanks for kindly doing so. It would definitely be helpful. Also, as we do have a new RFC process, it would be great if this feature can also have an official RFC there. Please refer to GitHub - apache/tvm-rfcs: A home for the final text of all TVM RFCs. for details.

@comaniac , thanks the prompt reply, sure I would do the the patch split and RFC upstream.