partitioning, it partitions the graph into subgraphs that are suitable for different devices. There are two options for graph level optimizations after we obtain the partitioned subgraphs.
a) Do graph level opts, such as fusion and precompute, on each subgraph, and then replace the original subgraph with its optimized counterpart.
b) Group the subgraphs together and perform graph level opts on a “single” graph, then split the subgraphs and replace the original ones.
It seems there is not much difference between these two methods. The second one might be cleaner.
Nodes can be annotated with context info during partitioning. Data copy operators can be inserted when the subgraphs are being replaced.
compilation and runtime, we have agreed that compile the subgraphs in a single graph would be more convenient because it eases the work of reconstructing the subgraphs and keeps the runtime cleaner.
Some concerns about how much modification of runtime is required.
The current build API only takes one target. We need to modify this API or add a new one to adapt multiple contexts.
We will need to load modules with different libs. Some work should be done for runtime to make it work.
Any comments and suggestions are greatly appreciated.
The most common use cases so far is fallback to CPU, I would recommend we start from there, and this do not require modification to the current interface
@tqchen Thanks for the suggestion. That essentially means all nodes are in a single graph and executed on CPU, right? I am currently trying to have a test to verify that the components on the critical path would work, e.g. annotating nodes, inserting data copying nodes, compiling a single graph, and executing in runtime, etc.
I am trying to register a cross device data copy operator. The inputs would be the output of one device and the output should be the input of another device. I think it should be compiled into TVM later. How should the operator look like? Thanks.
@tqchen I am doing a quick test by modifying graph_runtime.cc to enable heterogeneous execution of an annotated graph (e.g. each node is attached with an attribute context). I created a new interface that takes a GPU module and a CPU module with two context information. I tried to allocated memory for both CPU and GPU in SetupStorage() and then inserted TVMArrayCopyFromTo in SetupOpExecs after checking the context of the node being executing.
However, I got an error message when module.run() is called: TVMError: [16:50:18] /Users/chzhi/tvm/src/runtime/module_util.cc:52: Check failed: ret == 0 (-1 vs. 0) Assert fail: (1 == tvm_struct_get(arg4, 0, 10)), Argument arg4.device_type has an unsatisfied constraint
I think I probably misunderstood the storage_id field because I used it to get a node. Could you please comment on it? To make it more convenient, I attached a diff for the quick test (http://www.mergely.com/oITKlwdM/). Thanks:)
I am trying to execute the graph with different context information. A copy node is inserted in graph_runtime.cc (https://github.com/dmlc/tvm/blob/master/src/runtime/graph/graph_runtime.cc#L487) when the context of a input is different from the context the current node. All args are collected and then sent to create a tvm_op. But I failed to copy the data across devices. I debugged and logged the output of a node (suppose on CPU) that severs as the input of another node (suppose on GPU) in the fexec lambda function. It turned out data is not copied through different devices. Any idea about what’s wrong here? Thank you very much for your help:)
std::vector<DLTensor> args;
for (const auto& e : inode.inputs) {
DLTensor* tensor = nullptr;
const auto& input = nodes_[e.node_id];
uint32_t eid = this->entry_id(e);
DLTensor& etensor = data_entry_[eid];
if (input.context != inode.context) {
// allocate the exact space as the input and insert a cross device data copy operation
if (inode.context == "gpu") {
TVM_CCALL(TVMArrayAlloc(etensor.shape, etensor.ndim,
etensor.dtype.code, etensor.dtype.bits,
etensor.dtype.lanes, client_ctx_.device_type,
client_ctx_.device_id, &tensor));
} else {
TVM_CCALL(TVMArrayAlloc(etensor.shape, etensor.ndim,
etensor.dtype.code, etensor.dtype.bits,
etensor.dtype.lanes, ctx_.device_type,
ctx_.device_id, &tensor));
}
TVM_CCALL(TVMArrayCopyFromTo(&etensor, tensor, nullptr));
} else {
// inode and its current input are on the same device
tensor = &etensor;
}
args.push_back(*tensor);
// args.push_back(data_entry_[this->entry_id(e)]);
}
Or we could call TVMArrayCopyFromTo in the exec lambda function. But it pays execution cost at runtime… Inserting copy node before runtime should be a cleaner solution.
@tqchen I am trying to register a copy_op, but I am not quite sure how to register FTVMCompute and FTVMSchedule. I think FTVMCompute could be registered for doing anything. Then we can call TVMArrayCopyFromTo in runtime when the copy_op is detected. Another possible way is to call TVMArrayCopyFromTo during registration, but it seems we only have Tensors not DLTensors at this point.
For FTVMSchedule, I am not clear how to register it. Could you please provide some suggestions or pointers? Thanks.