@tqchen@jroesch
Would you like to add comments for this topic?
Big model is AI trend. If TVM community didn’t have plan to support inference of model parallelsim, can you comment upon the technique feasibility?
this topic is very interesting, currently we have a pending RFC/PR([RFC] Compute graph pipeline with new subgraph executor) related Model parallelism , it not designed for model papalism but it do some of the work what model parallelism ask like horizon split model, pipeline execution, reduce memory requirement, cross device memory movement, with tvm RPC help, the device/target also can be distributed.
I think it should can help for large model deploy, after just by pass communication operator.
@hjiang
Thanks for your suggestion.
Yes, we really consider this method: split computation graph and offload these sub computation graph to different device.
The drawback of this method is: It’s not scalable and some large model like GPT2 has the mechanism of model parallelism for inference inherently.
@yezhouhai Could you sharing some information about how framework like PyTorch handles parallel inference? More specifically, who is responsible for specifying part of model to a device?
PyTorch use DDP component to handle distributed training/inference. It provides communitive primitives or DDP optimizer. It’s user’s responsiblity to split model.
We finally successfully enabled model paralllism of inference with TVM.
Steps:
hook PyTorch DDP primitives because allreduce or allgather in PyTorch are API. They are not operator can’t be captured by jit trace. Instead, I replace them with dummy operators (allreduce, allgather) in PyTorch aten (very few lines. 8 lines).
Then torch.jit.trace/script can capture allreduce/allgather operator in model.
Add allreduce/allgather op support in tvm. It means you need to integrate communication library into tvm. It really take many engineering work for this step.
For Step 1,2, maybe there’s eaiser way to add allreduce/allgather operator in relay graph.