Pre-RFC: Multiple device support in Relay VM

kazimuth · October 8, 2021, 6:38pm

Currently the Relay VM only supports a single device:

apache/tvm/blob/main/src/runtime/vm/vm.cc#L685


      goto main_loop;
    }
    case Opcode::DeviceCopy: {
      auto tensor_src = ReadRegister(instr.src);
      NDArray src_data = Downcast<NDArray>(tensor_src);
      Device src_dev = src_data->device;
      ICHECK_EQ(static_cast<Index>(src_dev.device_type), instr.src_device_type);


      Device dst_dev;
      dst_dev.device_type = static_cast<DLDeviceType>(instr.dst_device_type);
      dst_dev.device_id = 0;


      NDArray dst_data = src_data.CopyTo(dst_dev);
      WriteRegister(instr.dst, dst_data);
      pc_++;
      goto main_loop;
    }
    default:
      LOG(FATAL) << "Unknown instruction opcode: " << int(instr.op);
  }
}

It would be useful to support multiple devices, e.g. for heterogeneous splitting of networks for data center workloads. I’ve been messing with this change on a private branch but I don’t have anything presentable yet. It is (technically) possible to represent this at the Relay IR level; the attributes for device copy & storage allocations have slots for (static) device IDs. However, the device ID information is currently thrown away during compilation for the VM. I’d like to change that.

The work to support heterogeneous execution has laid some groundwork here:

github.com/apache/tvm

[RELAY][VM] Enable heterogeneous execution for Relay VM

apache:master ← zhiics:hetero_vm

opened 11:28PM - 25 Aug 20 UTC

zhiics

+1631 -338

Currently, the dynamic models can only be executed for on CPU. The GPU execution… is not allowed for these models because they have shape functions to do runtime type inference. These functions may contain various control logic to derive the shape of a tensor at runtime and they are never compute intensive, therefore are designed to be executed on CPU. That being said, we must use CPU to execute these functions even when trying to run the whole model on other devices. This PR enables the heterogeneous execution for Relay VM to support dynamic models on devices other than CPU. More specifically, it includes the following changes: - [x] makes the memory_alloc and memory plan passes context aware when inserting vm/memory dialects. - [x] designs a union-find based context analysis pass to analyze the device context of the IR node in a relay program [Thanks @jroesch and @icemelon9 for help] - [x] implements a DeviceCopy instruction in VM to copy data directly cross different devices. - [x] enables GPU tests for various unit tests involving dynamic inputs/shape functions, namely those in test_any.py, test_adt.py, and test_vm.py, and dynamic namespace tests. - [x] tests heterogeneous execution for the static cases used for graph runtime (test_pass_annotation.py) - [x] fixes several bugs in the VM that are manifested by heterogeneous execution Followup PRs will fix/add schedules for some ops to enable GPU execution for Bert and TF objection detection models. cc @icemelon9 @jroesch @mbrookhart @wweic

However, more invasive changes are needed. In particular, the VM bytecode format will need to be modified to include device IDs on AllocStorage and DeviceCopy, and that data will need to be plumbed through various compilation passes.

Key questions:

What should the API for annotating modules with device information look like? It would be nice to support both homogeneous splitting (i.e. across a batch dimension) and heterogeneous splitting (anything else.)
Should device selection be static or dynamic? Static is simpler to implement, dynamic would be more flexible and could e.g. test the number of devices available and adapt based on that. The analysis passes determining device associations currently assume static device assignments.
How should we deal with constants? A simple implementation would change the relation between constants and devices from one-to-one and one-to-many. Alternatively, constants could all be logically associated with the CPU, and could be dynamically loaded to particular devices as needed.
How should this API be tested? I don’t believe the CI machines have multiple GPUs. One solution would be to implement a new device type, virtual cpu, which is pretty much the same as regular CPU but allows multiple contexts to be instantiated, and forbids using tensors associated with one context with another.

mbs-octoml · October 8, 2021, 8:10pm

Hi, there’s some overlap with https://github.com/apache/tvm-rfcs/pull/38. It tries to at least ensure device_id, as an uninterpreted int, is plumbed through from annotation to device_copy, parameter metadata, etc.

Your key questions, however, are very good ones and well outside the scope of the RFC:

Given a few device annotations we heuristically default devices for the rest of the program. But that’s a whole optimization problem in itself.
Right, it would be amazing to be able to shard a tensor across devices on the N dimension.
We consider the ‘devices’ we plan with to be ‘virtual devices’, but there’s currently no way to control the mapping from virtual to actual. We may want to choose the actual at runtime based on load, capabilities, etc.

For constants I think we could rewrite:

   @global_const = ...constant implicitly on device A...
   .... device_copy(@global_const, A, B) ...
   .... device_copy(@global_const, A, C) ...

by partially evaluating the device_copy and hoisting the result into new constants on B and C. However we currently don’t have a way to represent globally bound constants.