Graph partitioning and Heterogeneous Execution

zhiics · July 25, 2018, 5:00pm

Graph partitioning can be divided into two parts:

partitioning, it partitions the graph into subgraphs that are suitable for different devices. There are two options for graph level optimizations after we obtain the partitioned subgraphs.
a) Do graph level opts, such as fusion and precompute, on each subgraph, and then replace the original subgraph with its optimized counterpart.
b) Group the subgraphs together and perform graph level opts on a “single” graph, then split the subgraphs and replace the original ones.
It seems there is not much difference between these two methods. The second one might be cleaner.
Nodes can be annotated with context info during partitioning. Data copy operators can be inserted when the subgraphs are being replaced.
compilation and runtime, we have agreed that compile the subgraphs in a single graph would be more convenient because it eases the work of reconstructing the subgraphs and keeps the runtime cleaner.

Some concerns about how much modification of runtime is required.

The current build API only takes one target. We need to modify this API or add a new one to adapt multiple contexts.
We will need to load modules with different libs. Some work should be done for runtime to make it work.

Any comments and suggestions are greatly appreciated.

tqchen · July 25, 2018, 4:59pm

The most common use cases so far is fallback to CPU, I would recommend we start from there, and this do not require modification to the current interface

zhiics · July 26, 2018, 4:09pm

@tqchen Thanks for the suggestion. That essentially means all nodes are in a single graph and executed on CPU, right? I am currently trying to have a test to verify that the components on the critical path would work, e.g. annotating nodes, inserting data copying nodes, compiling a single graph, and executing in runtime, etc.

tqchen · July 26, 2018, 4:10pm

What I mean is to allow nodes to be executed on cpu or a target device(most cases gpu)

zhiics · July 26, 2018, 4:11pm

I see, thanks. That’s what I am right now doing.

zhiics · July 26, 2018, 8:18pm

I am trying to register a cross device data copy operator. The inputs would be the output of one device and the output should be the input of another device. I think it should be compiled into TVM later. How should the operator look like? Thanks.

tqchen · July 26, 2018, 9:38pm

This can be treatly specially at runtime, via copy, and runtime directly calls into TVMArrayCopyFromTo

zhiics · July 31, 2018, 12:41am

@tqchen I am doing a quick test by modifying graph_runtime.cc to enable heterogeneous execution of an annotated graph (e.g. each node is attached with an attribute context). I created a new interface that takes a GPU module and a CPU module with two context information. I tried to allocated memory for both CPU and GPU in SetupStorage() and then inserted TVMArrayCopyFromTo in SetupOpExecs after checking the context of the node being executing.

However, I got an error message when module.run() is called: TVMError: [16:50:18] /Users/chzhi/tvm/src/runtime/module_util.cc:52: Check failed: ret == 0 (-1 vs. 0) Assert fail: (1 == tvm_struct_get(arg4, 0, 10)), Argument arg4.device_type has an unsatisfied constraint

I think I probably misunderstood the storage_id field because I used it to get a node. Could you please comment on it? To make it more convenient, I attached a diff for the quick test (http://www.mergely.com/oITKlwdM/). Thanks:)

zhiics · August 1, 2018, 8:51pm

I am trying to execute the graph with different context information. A copy node is inserted in graph_runtime.cc (https://github.com/dmlc/tvm/blob/master/src/runtime/graph/graph_runtime.cc#L487) when the context of a input is different from the context the current node. All args are collected and then sent to create a tvm_op. But I failed to copy the data across devices. I debugged and logged the output of a node (suppose on CPU) that severs as the input of another node (suppose on GPU) in the fexec lambda function. It turned out data is not copied through different devices. Any idea about what’s wrong here? Thank you very much for your help:)

  std::vector<DLTensor> args;
  for (const auto& e : inode.inputs) {
   DLTensor* tensor = nullptr;
   const auto& input = nodes_[e.node_id];
   uint32_t eid = this->entry_id(e);
   DLTensor& etensor = data_entry_[eid];

   if (input.context != inode.context) {
     // allocate the exact space as the input and insert a cross device data copy operation
     if (inode.context == "gpu") {
       TVM_CCALL(TVMArrayAlloc(etensor.shape, etensor.ndim,
                               etensor.dtype.code, etensor.dtype.bits,
                               etensor.dtype.lanes, client_ctx_.device_type,
                               client_ctx_.device_id, &tensor));
     } else {
       TVM_CCALL(TVMArrayAlloc(etensor.shape, etensor.ndim,
                               etensor.dtype.code, etensor.dtype.bits,
                               etensor.dtype.lanes, ctx_.device_type,
                               ctx_.device_id, &tensor));
     }
     TVM_CCALL(TVMArrayCopyFromTo(&etensor, tensor, nullptr));
   } else {
     // inode and its current input are on the same device
     tensor = &etensor;
   }
   args.push_back(*tensor);
   // args.push_back(data_entry_[this->entry_id(e)]);
  }

tqchen · August 1, 2018, 9:11pm

The static graph execution should be done as follows:

Statically allocate the tensor space for each node, be it gPU or cpu
Introduce a special __copy op, that is not tvm_op, and call TVMArrayCopyFromTo to execute the copy

zhiics · August 1, 2018, 9:18pm

Yes, I allocated space for each node on either CPU or GPU.

So the __copy op should be a node that connects two nodes that run on different devices, right? If so, then should it be done before runtime? Thanks.

tqchen · August 1, 2018, 9:46pm

Yes, I think copy node should be inserted before runtime

zhiics · August 1, 2018, 9:51pm

Thanks a lot:) That is exactly I was thinking about…

But I thought TVMArrayCopyFromTo could do the magic at runtime. It seems this is not gonna work. Thanks again.

zhiics · August 1, 2018, 9:54pm

Or we could call TVMArrayCopyFromTo in the exec lambda function. But it pays execution cost at runtime… Inserting copy node before runtime should be a cleaner solution.

zhiics · August 4, 2018, 12:24am

@tqchen I am trying to register a copy_op, but I am not quite sure how to register FTVMCompute and FTVMSchedule. I think FTVMCompute could be registered for doing anything. Then we can call TVMArrayCopyFromTo in runtime when the copy_op is detected. Another possible way is to call TVMArrayCopyFromTo during registration, but it seems we only have Tensors not DLTensors at this point.

For FTVMSchedule, I am not clear how to register it. Could you please provide some suggestions or pointers? Thanks.

tqchen · August 4, 2018, 1:06am

Yes. You don’t have to register tvm compute if we treat it specially in runtime

zhiics · August 4, 2018, 1:09am

Thanks. But we will fail to lower if we don’t register tvm compute and schedule, right?

zhiics · August 4, 2018, 1:14am

Or can we somehow skip them during lowering but just create another type of op, say tvm_copy_op?

tqchen · August 15, 2018, 10:24pm

we can skip the lowering of copy op completely in the pipeline

zhiics · August 16, 2018, 4:26am

Thanks. That’s what I did… It eases registration of the copy op and also lowering…