Hi @shiy10. I’m glad you enjoyed the talk, and thank you for the
questions.
To my understanding, the tvm runtime executes the OPs in a
sequential order (graph_executor.cc),
From the point of view of the relay executor, that is correct.
One key distinction is between TVM’s runtime for executing relay
graphs
(GraphExecutor
,
vm::Executable
,
and
AOT),
and TVM’s runtime for calling device-specific APIs
(e.g. CUDAModuleNode
,
and
VulkanModuleNode
).
The relay runtimes are for executing full end-to-end models described
in a relay graph, while the latter are for executing individual
compute kernels described in a low-level TIR graph. The
GraphExecutor::Run
method is where these two runtimes interact, as
the relay runtime makes calls into a device-specific runtime to
executes an operation. Each element of the op_execs_
vector was
originally populated either with a device to device memory
copy
or with a function call to execute device-specific
calls
and we run these OPs in a single device (maybe I’m wrong about this)
Typically true today, but not necessarily the case. The exact
behavior depends on which level of abstraction you’re working at.
-
Compile-time Graph executor/VM/AOT codegen - The function should be
executed on whichever device is chosen at this step. Inserts array
allocations on the appropriate devices, and any device-to-device
copies that are required.
(This step has some active development to hoist the device/memory
planning so the information may be a bit outdated, but the general
steps are there. I believe that one of the goals is to reduce code
duplication between the three relay runtimes by pulling
device/memory planning into earlier steps.)
-
Run-time Graph executor/VM/AOT - The function should be executed on
whichever device is selected at compile-time, but no explicit
handling occurs at this level. Executes array allocations and
device-to-device copies as determined at compile-time. For each
operation, pass the arrays as arguments to an LLVM module that
handles the packed API.
-
LLVM module handling API - The function should be executed on
whichever device owns the argument buffers. The generated LLVM code
contains checks to verify that all argument buffers belong to the
same device (part of
binder.BindDLTensor
call), and to call DeviceAPI::SetDevice
(by inserting a call to
symbol::tvm_set_device
)
prior to any calls into a device-specific module.
-
Device-specific module - A kernel should be executed on whichever
device is currently active (from DeviceAPI::SetDeivce
), queued
onto whichever stream is currently active (from
DeviceAPI::SetStream
).
There’s been a lot of recent work towards heterogeneous compute, with
a relay graph being executed
Q1, what is the role of the class CUDAThreadEntry ?
Stores per-CPU-thread information. Both DeviceAPI::SetDevice
and
DeviceAPI::SetStream
apply to the current CPU thread, similar to how
CUDA handles cudaSetDevice
. Each CPU thread has its own copy of
CUDAThreadEntry
, which is then read to determine which device should
execute a kernel. This is at the “Device-specific module” level.
Q2, when adding a new device, can I ignore this class and do not
implement it ? (just as the instructions given in the tutorial) ?
It isn’t strictly needed for initial implementations, but the
per-CPU-thread semantics for the active device should be implemented
in some manner to handle multiple devices being used from multiple
host threads. For example, TVM’s Vulkan runtime stores
per-CPU-threads in a
tvm::runtime::ThreadMap
(e.g. VulkanDeviceAPI::active_device_id_per_thread
),
rather than having a globally accessible thread-local entry.