The interface could still handle things automatically. But we could divide the handling into two steps, the partition function that partitions the functions and set the attributes, and the compilation function that compiles the result module
I think we can just do the following:
mod = relay.build_extern(mod, target='xx')
mod = relay.build(mod, target='llvm')
build_extern
splits the graph into subgraphs and annotates them with external targets. The second one is the normal one for compilation. It can take whatever partitioned and annotated graph to produce the compiled artifacts.
Below are some areas that it wasnât entirely clear to me how theyâd be handled. This relates more-so to parallelism, but seems like this particular approach to bringing your own codegen would also impact how parallelism is represented, so Iâll ask it here, but let me know if I should be looking at some different proposal for that.
-
Some single ops will need to be executed concurrently and cooperatively across multiple devices, how do we represent that? This is typical for sea-of-nodes hardware and in general for model parallelism.
-
Just because an op can run on a device doesnât mean it should. For an extreme case, consider a TF graph that got split into several pieces and one of the pieces is just an addition by itself, or just a relu by itself. It doesnât make sense to transfer the inputs to the device and retrieve the output back just to do an addition. Also some ops might make sense to transfer back to the CPU because the CPU can do them faster and then back to the device, e.g. sparse lookups on non-sparsity-appropriate hardware - I think this proposal would require cutting the graph into pieces in this case, which has the usual problems that cutting of the graph entails. Automatically or manually determining which ops should run where to optimize performance and minimize memory usage (not just to do something that works) is going to be a big thing over time and specifying such a thing should be a good fit.
-
Two devices may need to communicate with each other and we do not want to force them to go through the host for a large transfer. This is again typical in sea-of-nodes hardware and comes up in other contexts like just plain model parallelism. How do we represent sending data between devices? Can two functions refer to each otherâs nodes?
-
(follows on from last point) How do we represent overlapping transfers of data with compute in a fine-grained way? This is an important optimization.
Thanks for the comments and they are valuable in my opinion. Here are my thoughts mainly about the scope of this RFC. We could refine it based on the discussion.
-
Could you make some examples so that we can see directly how it should work? It seems to me that this is an open question and we should narrow down to practical scenarios.
-
Correct. This proposal doesnât cover a mechanism to decide an op should go to which device. We simply do offloading based on the subgraph attribute. As the first step of supporting graph partitioning, we aim to make the offloading mechanism working so that we can follw-up with those advance issues easily.
-
We make the runtime mechanism straightforward by simply tansferring data at the subgraph boundaries. As a result, it is true that uncecessary data transfer happens with two consecutive subgraphs. We plan to address this issue in the subgraph merging problem, which we will file another RFC for discussion. In the subgraph merging problem, our goal is to minimize the number of subgraphs while preserving the correctness.
-
Similar as 3, we aim to merge offloadable ops and minimize the subgraph numbers in the follow-up RFC.
If you have a giant tensor coming into an op, you might want that op to be parallelized across multiple devices. This would apply to sea-of-nodes inference hardware and also comes up in training. TF has some documentation on it here:
I think minimizing subgraphs by merging them isnât going to be a full solution here. I think you want to be able to represent transfers between devices (including the host) in a way beyond just function calls and return values. Essentially you want some way of representing an edge between ops in different functions that are placed on different devices (counting the host as a device here) that can transfer asynchronously. The question is how to actually represent that, where thorny issues include things like safe memory allocation and dead locks (e.g. ensuring that both sides of the transfer arrive at the transfer at the right time).
A related question is how to pass in a buffer that is already on the device, e.g. passing in weights on the device that shouldnât be transferred for each inference. I think in this approach that would not be possible.
When tagging a function to run on a specific target, it seems not enough to say what the device type/backend is, itâs also necessary to say which of the devices of that type to run the op on, since a host can contain multiple devices.
I wonder if this proposal isnât mixing up two things that could be separate: A) the ability to compile and run ops with a custom codegen/runtime and mixing this within a graph, and then B) how to represent parallelism and cross-device communication (counting the host as a device), as in tying this to function calls. The solution to A has implied a lot of things about B, like how graph partition occurs and tying data transfer to function calls. I was responding just to the B part. I wonder if A and B could or should be separate proposals? (not sure - maybe they are just too intertwined to split apart)
In the interest to keep things concise, I think runtime parallelization is something that worths its own RFC.
If the decision of separation can be made at compile time(or JIT), Likely the same abstraction will stand for that purpose. e.g. havea DyanmicDispatcher module that schedules things into functions provided by each of the runtime module, while we can have runtime module that is not aware of parallelism.
I think we donât really bake the custom/codegen information into the subgraph/sub-function. We need some sort of this information at the high-level to help us partition the graph. Once the graph is partitioned, the new graph will contain super-nodes that will be dispatched to the target backends. The runtime of each subgraph is independent which is coordinated by the TVM runtime/executors.
We can extend this to support multiple backends by specifying multiple targets. But we need to have some mechanism to decide where an operator should be offloaded to when it could be supported by various devices. This is currently not supported. The short-term goal is to prepare the basic infra first, and more sophisticated coloring/annotation will be considered later.
I agree that the current design doesnât consider about the runtime parallelism (both op and model level), but as Tianqi said this is worthy a separate discussion.
I think weâve solved most of the problem in the early discussion. I will prepare a WIP PR and see if we missed any important thing.
Re op-level.
I tried to do this similar things internally. If we do op-level, which maybe break other inference engine existing optimization. for example, conv + relu, broken into conv (3rd party) + relu (tvm). But it is better conv(3rd party) + relu (3rd party).
Or your op-level is just wrapper, which will do the similar things like subgraph-level?
Since the TVM op-fusion pass will bypass external functions, ops in a subgraph will be fused. One concern is that the third-party vendor may not know how to deal with fused ops, and it results in a steep learning curve and long development time.
According to the concern, we preserve the ops in the subgraph and let the backend decide if they can be fused or optimized. In the WIP PR we have filed for this RFC, you could use the following approaches to address the issue you mentioned in the example.
- Use
extern_op
to annotate both conv2d and ReLU. - Implement an annotator to put both conv2d and ReLU in a subgraph.
As a result, both of them will be executed on the 3rd party device. Note that the first approach will result in two subgraphs in the current PR, so the second one is more desirable if performance is the main concern. We are planning to have an algorithm to merge small subgraphs to a big one to enable more optimization opportunities as well as minimize the data transfer and kernel launch overhead.
Great feature!
Some questions/thoughts on more complex partitioning and the relation to quantization:
-
What if you want to partition the graph using rules across multiple operations, e.g. an accelerator can handle conv2d->relu but not relu->conv2d? Can this be done by implementing a custom annotator?
-
Use cases might have a need for a similar custom annotator that groups multiple operations in the same subgraph, possibly using cross operation partitioning rules. Would it be an idea to work on a specification for partitioning rules and a corresponding out-of-the-box partitioning/annotator algorithm that can load and apply the provided rules (e.g. from dictionary/json)?
-
I agree with @broune that scheduling operations/subgraphs on a heterogeneous hardware platform will become very important for performance at some point so it might be good to keep this in mind already. For example, this is something that could be handled by AutoTVM? So how would partitioning integrate with that system? I guess this is probably a feature for partitioning v(X>1) but I am mentioning it because it seems at odds with implementing custom partitioners/annotators. In the future, I think you want a partitioner that is aware of all the possible targets with their particular partitioning rules (which might be across operations) so it can optimize for that?
-
How does the partitioning relate to quantization? Suppose I want to only quantize the subgraph (or use a different quantization algorithm for a subgraph and the rest of the graph)?
My responses to 1-3:
- You can customize the annotator to generate subgraph 1 while avoiding subgraph 2:
subgraph 1:
subgaph_begin -> conv2d -> ReLU -> subgraph_end
subgraph 2:
subgraph_begin -> ReLU -> conv2d -> subgraph_end
-
Thatâs our next step. We will develop an algorithm to group (or say partition/annotate) offloadable ops to one subgraph and the backend can decide weather to fuse them or not.
-
This is a valuable question and we are also investigating it. Whatâs lacking now is that we need an interface/API to let backend developers represent a tuning space and invoke AutoTVM. This will be a follow-up work, but whatever the interface/API we will come up, it will fall into the current design (specifically, this logic will be in the compile function in the customized codegen).
@jtuyls, thanks for the comments.
What if you want to partition the graph using rules across multiple operations, e.g. an accelerator can handle conv2d->relu but not relu->conv2d? Can this be done by implementing a custom annotator?
Yes, you can decide how you want to annotate based on the inputs.
Would it be an idea to work on a specification for partitioning rules and a corresponding out-of-the-box partitioning/annotator algorithm that can load and apply the provided rules (e.g. from dictionary/json)?
This is a good point. We actually allow this if users use the customized pass to annotate the graph. They can use whatever self-defined algorithms/rules to annotate the graph. But we do need to have some mechanism to make sure that the arbitrarily annotated graph is valid. One of the main reasons for us to provide the annotation template to users/vendors is to give them the flexibility to quickly offload operators in a transparent way.
For example, this is something that could be handled by AutoTVM? So how would partitioning integrate with that system? I guess this is probably a feature for partitioning v(X>1) but I am mentioning it because it seems at odds with implementing custom partitioners/annotators. In the future, I think you want a partitioner that is aware of all the possible targets with their particular partitioning rules (which might be across operations) so it can optimize for that?
Yeah, I think this should be compatible to AutoTVM, but I havenât really done much. This is very interesting. But I am not sure how useful it would be if the subgraph is handled by accelerators which AutoTVM doesnât have much context. Depending on the use cases, we might need to consider multiple targets in the future and partition the graph in the way that it can efficiently take advantage of each of them. But the current design is actually flexible to handle it.
How does the partitioning relate to quantization? Suppose I want to only quantize the subgraph (or use a different quantization algorithm for a subgraph and the rest of the graph)?
The partitioned subgraph is still a valid Relay function so we can still apply optimizations, such as quantization, on it.
Thanks for the responses.
Whatâs lacking now is that we need an interface/API to let backend developers represent a tuning space and invoke AutoTVM.
That would also be very cool but I was thinking in the direction of optimizing the subgraph scheduling. So optimization at the partitioning level, which subgraphs are possible given all targets and then choosing among them. The tuning space would be the partitioning rules which allow you to generate valid subgraph combinations to evaluate. Given a specific subgraph combination you then get a smaller low-level AutoTVM tuning subspace (similar to the current one, possibly even extended with tuning spaces for the accelerators like you are mentioning).
But I am not sure how useful it would be if the subgraph is handled by accelerators which AutoTVM doesnât have much context.
If you donât have context your accelerator is a black-box but you can still optimize for different subgraph combinations. So your tuning space consists of the high level partitioning rules and low level CPU/GPU optimizations. If you allow developers to add context, the tuning space will become even larger.
Depending on the use cases, we might need to consider multiple targets in the future and partition the graph in the way that it can efficiently take advantage of each of them. But the current design is actually flexible to handle it.
In the spirit of âchallenging the RFCâ, I am still wondering: suppose you have a graph: conv2d1 -> relu -> conv2d2 and two targets: NPU1 and NPU2.
NPU1 annotator gives you:
subgraph_begin -> conv2d1 -> relu -> conv2d2 -> subgraph_end
NPU2 annotator gives you:
subgraph_begin -> conv2d1 -> relu -> subgraph_end -> conv2d2
Now, there are multiple possible ways to partition the original graph using the annotations for the two NPUs but not all of them are necessarily valid. Suppose you opt for:
subgraph_begin -> conv2d1 -> subgraph_end -> subgraph_begin -> relu -> conv2d2 -> subgraph_end
where the first subgraph is for NPU2 and the second for NPU1. Now, because NPU1 can handle conv2d -> relu -> conv2d doesnât necessarily mean that relu -> conv2d is also valid.
So, if I understand the custom annotator approach correctly, the problem for this case is that annotations give you only one way to partition the graph but donât specify all possible combinations. This might become a problem once you want to optimize for multiple targets (or even CPU+NPU).
The scope of this RFC doesnât include automatic graph partitioning, so we require developers to specify the partition rules for their codegen. However, once we have a smart way to do partition in a general way, it is easy to to be added and let developers leverage directly.
For the multiple targets, I think the issue you raised is a more general scenario that considers more than one types of external backends. While I am not sure the practical use case of this scenario due to the complexity and data transfer overhead, an intuitive solution in my mind is simply assigning an order to their annotators. Supposing we assign NPU1 > NPU2, then you first have:
subgraph_begin -> conv2d1 -> relu -> conv2d2 -> subgraph_end
then NPU2 will not see partitionable ops so nothing would happen. As a result, the subgaphs are always valid. Of course, how to decide which op should go to which kind of external devices is an open-ended research problem and should be handled in another follow-up RFC.
In addition, multiple same kind devices is another issue. I would use the term âplacementâ instead of âpartitionâ to differentiate it. Supposing you have two same kind NPUs (and they have the same annotator of course), then placing which subgraph to which NPU should be determined in runtime to consider the real-time situations.
Does this schema use topi schedule for ops? or just by pass the schedule and call the extern APIs for ops? graph optimization (graph partitioning)â> backend 3rd codegen for subgraphs, So the subgraph is compiled by 3rd codegen directly, do not need to translate to halide IR at first? @zhiics
@wda Yes, we donât need to lower the functions/subgraphs that will be compiled by 3rd codegen tools into TVM IR and then Halide IR. They will be handled by external tools.
@zhiics first of all thanks for the great contribution!
Iâve alreaday built TVM with USE_DNNL_CODEGEN
enabled and now Iâd like to test this new build. What do I have to change in the standard scripts that I use to build and run my models using Relay to use dnnl
codegen?
While we are working on a tutorial for all of these merged PRs, you can first refer to our unit test at tests/python/relay/test_external_codegen.py.
However, the DNNL backend has to be polished further to achieve desired performance. Now it only supports limited number of ops, so there will be lots of pieces of subgraphs and result in data transfer overhead. You are still welcome to give it a shot and even help us make it better if possible
Here, at CEVA DSP, weâd like to write codegen(s) that uses CEVAâs DSP processors and AI accelerators. And later on upload to main branch as a backend. The question: Should I use https://github.com/zhiics/tvm/tree/partitioning as base or latest release V0.6?
Considerations: Partitioning fork seems to be exactly what I need. The issue is how likely is it to be merged into the main release (in one form or another).
Thank you in advance.
@etom42 Thanks for your interest. Weâve already split the code in my branched and committed to the upstream TVM code base. The only thing left is the annotation template which may need a bit more discussion. But we allow users to annotate their own graph and partition it based on the annotated graph. There are tutorials under review. You may want to take a look at them.