How CUDA kernel is launched in TVM stack

sunlex0717 · April 1, 2020, 10:20am

Hi all:

I am learning the TVM CUDA backend. I have a question about how CUDA kernel is launched.

Below is my simple test program:

import tvm
from tvm import te
import numpy as np

dtype = "float32"
# GEMM size
M=16;K=8;N=16
# declear algorithm 
k = te.reduce_axis((0, K), 'k') # loop over dimension K
A = te.placeholder((M, K), name='A')
B = te.placeholder((K, N), name='B')
C = te.compute(
           (M, N),
           lambda x, y: te.sum(A[x, k] * B[k, y], axis=k),
           name='C')
# defualt schedule 
s = te.create_schedule(C.op)
#print(tvm.lower(s, [A, B, C], simple_mode=True))
# optimized schedule : tiling
bn = 4 # Tiling size: 4, over M, and N
# outer -> inner
xo, yo, xi, yi = s[C].tile(C.op.axis[0], C.op.axis[1], bn, bn)
#print(tvm.lower(s, [A, B, C], simple_mode=True))
AS =  s.cache_read(A, 'shared',[C])
BS =  s.cache_read(B, 'shared',[C])
s[AS].compute_at(s[C], xo)
s[BS].compute_at(s[C], yo)
s[C].bind(xo, te.thread_axis("blockIdx.x"))
s[C].bind(yo, te.thread_axis("blockIdx.y"))
s[C].bind(xi, te.thread_axis("threadIdx.x"))
s[C].bind(yi, te.thread_axis("threadIdx.y"))
target = 'cuda'
ctx = tvm.context(target, 0)
a = tvm.nd.array(np.random.rand(M, K).astype(dtype), ctx)
b = tvm.nd.array(np.random.rand(K, N).astype(dtype), ctx)
# comput C through numpy lib
answer = np.dot(a.asnumpy(), b.asnumpy())

func = tvm.build(s, [A, B, C], target=target, name='mmult')
c = tvm.nd.array(np.zeros((M, N), dtype=dtype), ctx)
# a, b : input matrix, c : resul
func(a, b, c)
tvm.testing.assert_allclose(c.asnumpy(), answer, rtol=1e-5)
#print(func.get_source())
dev_module = func.imported_modules[0]
print(dev_module)
print("-----GPU code-----")
print(dev_module.get_source())

The generated CUDA code:

extern "C" __global__ void mmult_kernel0( float* __restrict__ A,  float* __restrict__ B,  float* __restrict__ C) {
  __shared__ float A_shared[32];
  __shared__ float B_shared[32];
  for (int ax0 = 0; ax0 < 4; ++ax0) {
    for (int ax1 = 0; ax1 < 8; ++ax1) {
      A_shared[(((ax0 * 8) + ax1))] = A[((((((int)blockIdx.x) * 32) + (ax0 * 8)) + ax1))];
    }
  }
  for (int ax01 = 0; ax01 < 8; ++ax01) {
    for (int ax11 = 0; ax11 < 4; ++ax11) {
      B_shared[(((ax01 * 4) + ax11))] = B[((((ax01 * 16) + (((int)blockIdx.y) * 4)) + ax11))];
    }
  }
  C[(((((((int)blockIdx.x) * 64) + (((int)threadIdx.x) * 16)) + (((int)blockIdx.y) * 4)) + ((int)threadIdx.y)))] = 0.000000e+00f;
  __syncthreads();
  for (int k = 0; k < 8; ++k) {
    C[(((((((int)blockIdx.x) * 64) + (((int)threadIdx.x) * 16)) + (((int)blockIdx.y) * 4)) + ((int)threadIdx.y)))] = (C[(((((((int)blockIdx.x) * 64) + (((int)threadIdx.x) * 16)) + (((int)blockIdx.y) * 4)) + ((int)threadIdx.y)))] + (A_shared[(((((int)threadIdx.x) * 8) + k))] * B_shared[(((k * 4) + ((int)threadIdx.y)))]));
  }
}

Which is straightforward. But what confused me is that, how this kernel mmult_kernel0 is launched by host(CPU, LLVM backend). I did not see how blockdim and griddim is configured. We know normally we launch a CUDA kernel from CPU by:

kernel<<<griddim,blockdim>>>(a,b,c)

How TVM manage this settings? Could anyone give me some tips? @tqchen @FrozenGene

sunlex0717 · April 1, 2020, 11:30am

BTW, I am also wondering if TVM stack supports CUDA streaming features like (https://devblogs.nvidia.com/gpu-pro-tip-cuda-7-streams-simplify-concurrency/)

masahi · April 1, 2020, 4:43pm

The answer is we use CUDA driver API to launch kernels from C++ code. kernel<<<griddim,blockdim>>>(a,b,c) is not the only way to launch kernel and it requires compiling with NVCC.

See

github.com

apache/incubator-tvm/blob/e0122c0ea68043372220e4e02b81692c34832227/src/runtime/cuda/cuda_module.cc#L189


void operator()(TVMArgs args,
                TVMRetValue* rv,
                void** void_args) const {
  int device_id;
  CUDA_CALL(cudaGetDevice(&device_id));
  if (fcache_[device_id] == nullptr) {
    fcache_[device_id] = m_->GetFunc(device_id, func_name_);
  }
  CUstream strm = static_cast<CUstream>(CUDAThreadEntry::ThreadLocal()->stream);
  ThreadWorkLoad wl = thread_axis_cfg_.Extract(args);
  CUresult result = cuLaunchKernel(
      fcache_[device_id],
      wl.grid_dim(0),
      wl.grid_dim(1),
      wl.grid_dim(2),
      wl.block_dim(0),
      wl.block_dim(1),
      wl.block_dim(2),
      0, strm, void_args, 0);
  if (result != CUDA_SUCCESS && result != CUDA_ERROR_DEINITIALIZED) {
    const char *msg;

There is a longer explanation of “a life of vector add” from python definition to cuda kernel launch here

https://docs.tvm.ai/dev/codebase_walkthrough.html#vector-add-example

sunlex0717 · April 1, 2020, 7:47pm

Hi: Thank you for your help! So, based on my understanding for these codes. in python

func(a,b,c)

will call this

void operator() (TVMArgs args,
                TVMRetValue* rv,
                void** void_args) const

And grid_dim, block_dim are inferred from TVMArgs args( a and b )in this case by

ThreadWorkLoad wl = thread_axis_cfg_.Extract(args);

And we can not manually set grid_dim, block_dim. Am I correct?

Thank you very much!

masahi · April 1, 2020, 10:17pm

Correct. You can tweak the schedule to change the launch config, but as a user you shouldn’t care about the exact size of grid/block.

If you really want the best perf, use autotvm to tune your schedule, and the resulting grid/block size is optimal based on real measurament.

sunlex0717 · April 2, 2020, 8:28am

Hi:

Thanks for you answer. I will check autotvm to see how it tunes grid/block. Because based on experience, grid/block dims will affect performance.

And another question is that, I see there is arg for cuda stream

CUstream strm = static_cast<CUstream>(CUDAThreadEntry::ThreadLocal()->stream);

I didn’t find any documents about cuda streaming supports in TVM, could you give me a hints about how we could use streaming?

Thank you very much!

masahi · April 2, 2020, 9:00am

I don’t know or think if we are exposing CUDA stream abstraction to python frontend. We typically don’t care about cuda stream (we don’t support any concurrency at runtime).

What is your use case?

sunlex0717 · April 2, 2020, 9:18am

Hi:

I am investigating the capability of TVM primitives (CUDA backend). I take CUTLASS as a baseline of highly-optimized CUDA library.

I think most of optimization techniques used in CUTLASS like tiling, shared_mem management are supported by TVM primitives.

Streaming is also an important optimization technique I think, but I did not find this property in TVM (python frond-end ). So I am wondering how can we use streaming in TVM stack. I think streaming is an important property for CUDA backend.

minisparrow · January 28, 2022, 12:59pm

from the code we can see griddim and blockdim are extracted from “TVMArgs args”, but how the args are passed from c++ side or python side, I can see the it directly passed from f(args, ret, addr) in pack_args.h file, but when up to upper stack, it will go through the python side. I can’t find where the args originally come from? Does it exist in the host code ir or other code?

masahi · January 28, 2022, 7:47pm

Thread sizes are extracted at https://github.com/apache/tvm/blob/bc505fcf80380bb39f76288c8651d22f873ecea1/src/tir/transforms/split_host_device.cc#L46-L54

split_host_device.cc is the key bridge between compile time and runtime world, so I suggest studying it carefully if you want to know the details.

And you can grep attr::thread_extent to find where AttrStmt with attr::thread_extent attributes is created in python or c++.

malixian · November 9, 2022, 9:40am

Hi, Have you found the source of TVMArgs? I’m also confused where did the grid size and block size come from.

ujay · December 5, 2022, 9:08am

Hi, I’m also confued by this question. Have you found any solution so far?