@comaniac , thanks for the comments, following are related answer for questions.
#1 At the first glance most implementations, including the Relay passes, were done in Python. It would be better to implement them in C++ for better performance
about “pipeline_graph” logic, because this part logic just do one time relay graph enumeration, the performance different between c++ and python should be very tiny, at same time we also found some existing logic like VTA:graphpatch.py that did similar graph work also implement in python level, then we think we my can ignore this tiny performance different then leave it in python level.
The term and namespace “subgraph” is improper and confusing
the split logic is to split a “relay graph” into a group of small “relay graph”, and all new small “relay graph” is part of original “relay graph”, then we named it subgraph, the subgraph would get pipelined for compute, could I get more information about why this improper and confusing, do you have any recommend name?
It would be better to break the PR down to 3 smaller PRs (one PR per component) for reviewing. Each component should have its unit tests with hand crafted inputs instead of a single set of unit tests to test all 3 component.
This make sense , actually that is what we plan to do at beginning as mentioned in former RFC(Compute Graph Pipeline - #13 by hjiang), that time as we discussed, a PR without runtime seems like be not self contained and difficult for review, then I create this new PR with runtime logic to make all logic be self contained and review friendly
But I also understand that a serial of small PR should be more easy for review, I would put the PR information in each PR to make sure reviewer can know these PR relation to get whole image about the feature
about PR split, the first module and second module have no strong dependency, for sure we can split them, but second module with third module are tightly coupled, we may need to keep them as one whole PR, please let me know how you think?
Compoenent 1: What’s the current split logic? From the PR it seems to me that you simply split every op to a subgraph? Ideally we should perform dependency analysis and minimize the number of subgraphs.
the split logic is coming from manually configuration, we not split every op to a subgraph, for example for a “relay graph” that have 5 operators, if the split logic is [1], then we would get 2 subgraph, first subgraph include these operator {0,1}, second subgraph include these operators {2, 3, 4}, internal dependency not get change in both subgraph, and external , subgraph2 input data should be subgraph1 output
Component 2: It seems not ideal to let users build each subgraph manually, and from your test case, it seems like users have to make sure the built runtimes are in the correct order? The order should be guaranteed in some artifacts
yes currently user should make sure the correct order, but use a single build may have 2 problem, first user lose the control for the build configuration, like optimize level, by pass which operator, put these information into artifacts seems may make artifacts be too complex and difficult for use, secondly even use artifacts and single build, user still need to put the order information.
to fix such issue, how about add a device list parameter for subgraph_create like following, to guarantee the order is correct?
dev_list = [“cuda”, “cpu”, “vta”]
runtime = create(libs, dev_list]