Questions about ThreadEntry in runtime module

shiy10 · December 20, 2021, 11:56am

Hi，I have watched the developer tutorial given by @Lunderberg in TVM Conf 21. The great talk helps me obtain the outline of adding a new device.

However, after checking the source code in CUDA runtime , I have following questions about ThreadEntry (CUDAThreadEntry), which is not included in the tutorial.

/*! \brief Thread local workspace */

class CUDAThreadEntry {

 public:

  /*! \brief The cuda stream */

  cudaStream_t stream{nullptr};

  /*! \brief thread local pool*/

  WorkspacePool pool;

  /*! \brief constructor */

  CUDAThreadEntry();

  // get the threadlocal workspace

  static CUDAThreadEntry* ThreadLocal();

};

To my understanding, the tvm runtime executes the OPs in a sequential order (graph_executor.cc), and we run these OPs in a single device (maybe I’m wrong about this); if so,

Q1, what is the role of the class CUDAThreadEntry ?

Q2, when adding a new device, can I ignore this class and do not implement it ? (just as the instructions given in the tutorial) ?

Looking forward to help.

Lunderberg · December 29, 2021, 4:39pm

Hi @shiy10. I’m glad you enjoyed the talk, and thank you for the questions.

To my understanding, the tvm runtime executes the OPs in a sequential order (graph_executor.cc),

From the point of view of the relay executor, that is correct.

One key distinction is between TVM’s runtime for executing relay graphs (GraphExecutor, vm::Executable, and AOT), and TVM’s runtime for calling device-specific APIs (e.g. CUDAModuleNode, and VulkanModuleNode). The relay runtimes are for executing full end-to-end models described in a relay graph, while the latter are for executing individual compute kernels described in a low-level TIR graph. The GraphExecutor::Run method is where these two runtimes interact, as the relay runtime makes calls into a device-specific runtime to executes an operation. Each element of the op_execs_ vector was originally populated either with a device to device memory copy or with a function call to execute device-specific calls

and we run these OPs in a single device (maybe I’m wrong about this)

Typically true today, but not necessarily the case. The exact behavior depends on which level of abstraction you’re working at.

Compile-time Graph executor/VM/AOT codegen - The function should be executed on whichever device is chosen at this step. Inserts array allocations on the appropriate devices, and any device-to-device copies that are required.

(This step has some active development to hoist the device/memory planning so the information may be a bit outdated, but the general steps are there. I believe that one of the goals is to reduce code duplication between the three relay runtimes by pulling device/memory planning into earlier steps.)
Run-time Graph executor/VM/AOT - The function should be executed on whichever device is selected at compile-time, but no explicit handling occurs at this level. Executes array allocations and device-to-device copies as determined at compile-time. For each operation, pass the arrays as arguments to an LLVM module that handles the packed API.
LLVM module handling API - The function should be executed on whichever device owns the argument buffers. The generated LLVM code contains checks to verify that all argument buffers belong to the same device (part of binder.BindDLTensor call), and to call DeviceAPI::SetDevice (by inserting a call to symbol::tvm_set_device) prior to any calls into a device-specific module.
Device-specific module - A kernel should be executed on whichever device is currently active (from DeviceAPI::SetDeivce), queued onto whichever stream is currently active (from DeviceAPI::SetStream).

There’s been a lot of recent work towards heterogeneous compute, with a relay graph being executed

Q1, what is the role of the class CUDAThreadEntry ?

Stores per-CPU-thread information. Both DeviceAPI::SetDevice and DeviceAPI::SetStream apply to the current CPU thread, similar to how CUDA handles cudaSetDevice. Each CPU thread has its own copy of CUDAThreadEntry, which is then read to determine which device should execute a kernel. This is at the “Device-specific module” level.

Q2, when adding a new device, can I ignore this class and do not implement it ? (just as the instructions given in the tutorial) ?

It isn’t strictly needed for initial implementations, but the per-CPU-thread semantics for the active device should be implemented in some manner to handle multiple devices being used from multiple host threads. For example, TVM’s Vulkan runtime stores per-CPU-threads in a tvm::runtime::ThreadMap (e.g. VulkanDeviceAPI::active_device_id_per_thread), rather than having a globally accessible thread-local entry.

puddingfjz · January 1, 2022, 3:05am

Hi @Lunderberg . If I want to run two GPU kernels on the same GPU at the same time, what should I do? Is it using the SetStream method? Since I am not familiar with CUDA, can you provide an example code to me? Thanks a lot!

Lunderberg · January 18, 2022, 3:15pm

Hi @puddingfjz. Yes, the SetStream method would allow two GPU kernels to be executed at the same time. This is exposed from in the python API as the Device.set_raw_stream method. Whichever stream has been most recently set as active will be used when launching a kernel.

dev = tvm.device('cuda')
stream1 = dev.create_raw_stream()
stream2 = dev.create_raw_stream()

dev.set_raw_stream(stream1)
func(input1, filter1, output1)

dev.set_raw_stream(stream2)
func(input2, filter2, output2)

That said, there are a few caveats:

This applies to explicit launching of a kernel defined in TE, and explicitly launched. To my knowledge, the stream API isn’t widely used in Relay model executors. It probably works in that case, but I haven’t tested it and that may change in the future if Relay models internally change the execution stream to fully utilize the GPU.
On backends that do not yet support multiple execution streams, calling set_raw_stream has no effect.
Kernel execution is performed assuming that all GPU resources can be used for a single kernel, and so this may result in inefficient execution.