Does unity support Distributed-Model-Infering，such as Multi-GPUs?

anders · June 29, 2023, 8:54am

with the popular of LLMs，NLP models is becoming more and more big，although can be quantized, its still hard to deploy one LLM model in one GPU Ram. running LLMs on multi-Host-multi-GPUs may be the usage solution right now.

I’m wondering, if we can deploy large model on multiply GPUs with tvm-unity?

based on my previous understading of TVM，if one want to run large model with tvm，he/she can:

split model to multipy sub-graph，compile one-by-one，and infer thems with tvm-pipeline, which different sub-graph can be put on different GPUs based on current hardware resources
maybe use the Heterogeneous Execution for relax with the VDevice. [RFC][Unity][Relax] Heterogeneous Execution for Relax - Development / unity - Apache TVM Discuss

tqchen · June 29, 2023, 12:01pm

Thanks for chiming in, this is indeed one area we should push for and some of the items you listed are on the right direction