How CUDA kernel is launched in TVM stack

Hi all:

I am learning the TVM CUDA backend. I have a question about how CUDA kernel is launched.

Below is my simple test program:

import tvm
from tvm import te
import numpy as np

dtype = "float32"
# GEMM size
# declear algorithm 
k = te.reduce_axis((0, K), 'k') # loop over dimension K
A = te.placeholder((M, K), name='A')
B = te.placeholder((K, N), name='B')
C = te.compute(
           (M, N),
           lambda x, y: te.sum(A[x, k] * B[k, y], axis=k),
# defualt schedule 
s = te.create_schedule(C.op)
#print(tvm.lower(s, [A, B, C], simple_mode=True))
# optimized schedule : tiling
bn = 4 # Tiling size: 4, over M, and N
# outer -> inner
xo, yo, xi, yi = s[C].tile(C.op.axis[0], C.op.axis[1], bn, bn)
#print(tvm.lower(s, [A, B, C], simple_mode=True))
AS =  s.cache_read(A, 'shared',[C])
BS =  s.cache_read(B, 'shared',[C])
s[AS].compute_at(s[C], xo)
s[BS].compute_at(s[C], yo)
s[C].bind(xo, te.thread_axis("blockIdx.x"))
s[C].bind(yo, te.thread_axis("blockIdx.y"))
s[C].bind(xi, te.thread_axis("threadIdx.x"))
s[C].bind(yi, te.thread_axis("threadIdx.y"))
target = 'cuda'
ctx = tvm.context(target, 0)
a = tvm.nd.array(np.random.rand(M, K).astype(dtype), ctx)
b = tvm.nd.array(np.random.rand(K, N).astype(dtype), ctx)
# comput C through numpy lib
answer =, b.asnumpy())

func =, [A, B, C], target=target, name='mmult')
c = tvm.nd.array(np.zeros((M, N), dtype=dtype), ctx)
# a, b : input matrix, c : resul
func(a, b, c)
tvm.testing.assert_allclose(c.asnumpy(), answer, rtol=1e-5)
dev_module = func.imported_modules[0]
print("-----GPU code-----")

The generated CUDA code:

extern "C" __global__ void mmult_kernel0( float* __restrict__ A,  float* __restrict__ B,  float* __restrict__ C) {
  __shared__ float A_shared[32];
  __shared__ float B_shared[32];
  for (int ax0 = 0; ax0 < 4; ++ax0) {
    for (int ax1 = 0; ax1 < 8; ++ax1) {
      A_shared[(((ax0 * 8) + ax1))] = A[((((((int)blockIdx.x) * 32) + (ax0 * 8)) + ax1))];
  for (int ax01 = 0; ax01 < 8; ++ax01) {
    for (int ax11 = 0; ax11 < 4; ++ax11) {
      B_shared[(((ax01 * 4) + ax11))] = B[((((ax01 * 16) + (((int)blockIdx.y) * 4)) + ax11))];
  C[(((((((int)blockIdx.x) * 64) + (((int)threadIdx.x) * 16)) + (((int)blockIdx.y) * 4)) + ((int)threadIdx.y)))] = 0.000000e+00f;
  for (int k = 0; k < 8; ++k) {
    C[(((((((int)blockIdx.x) * 64) + (((int)threadIdx.x) * 16)) + (((int)blockIdx.y) * 4)) + ((int)threadIdx.y)))] = (C[(((((((int)blockIdx.x) * 64) + (((int)threadIdx.x) * 16)) + (((int)blockIdx.y) * 4)) + ((int)threadIdx.y)))] + (A_shared[(((((int)threadIdx.x) * 8) + k))] * B_shared[(((k * 4) + ((int)threadIdx.y)))]));

Which is straightforward. But what confused me is that, how this kernel mmult_kernel0 is launched by host(CPU, LLVM backend). I did not see how blockdim and griddim is configured. We know normally we launch a CUDA kernel from CPU by:


How TVM manage this settings? Could anyone give me some tips? @tqchen @FrozenGene


BTW, I am also wondering if TVM stack supports CUDA streaming features like (

The answer is we use CUDA driver API to launch kernels from C++ code. kernel<<<griddim,blockdim>>>(a,b,c) is not the only way to launch kernel and it requires compiling with NVCC.


There is a longer explanation of “a life of vector add” from python definition to cuda kernel launch here

Hi: Thank you for your help! So, based on my understanding for these codes. in python


will call this

void operator() (TVMArgs args,
                TVMRetValue* rv,
                void** void_args) const

And grid_dim, block_dim are inferred from TVMArgs args( a and b )in this case by

ThreadWorkLoad wl = thread_axis_cfg_.Extract(args);

And we can not manually set grid_dim, block_dim. Am I correct?

Thank you very much!

1 Like

Correct. You can tweak the schedule to change the launch config, but as a user you shouldn’t care about the exact size of grid/block.

If you really want the best perf, use autotvm to tune your schedule, and the resulting grid/block size is optimal based on real measurament.


Thanks for you answer. I will check autotvm to see how it tunes grid/block. Because based on experience, grid/block dims will affect performance.

And another question is that, I see there is arg for cuda stream

CUstream strm = static_cast<CUstream>(CUDAThreadEntry::ThreadLocal()->stream);

I didn’t find any documents about cuda streaming supports in TVM, could you give me a hints about how we could use streaming?

Thank you very much!

I don’t know or think if we are exposing CUDA stream abstraction to python frontend. We typically don’t care about cuda stream (we don’t support any concurrency at runtime).

What is your use case?


I am investigating the capability of TVM primitives (CUDA backend). I take CUTLASS as a baseline of highly-optimized CUDA library.

I think most of optimization techniques used in CUTLASS like tiling, shared_mem management are supported by TVM primitives.

Streaming is also an important optimization technique I think, but I did not find this property in TVM (python frond-end ). So I am wondering how can we use streaming in TVM stack. I think streaming is an important property for CUDA backend.

from the code we can see griddim and blockdim are extracted from “TVMArgs args”, but how the args are passed from c++ side or python side, I can see the it directly passed from f(args, ret, addr) in pack_args.h file, but when up to upper stack, it will go through the python side. I can’t find where the args originally come from? Does it exist in the host code ir or other code?

Thread sizes are extracted at is the key bridge between compile time and runtime world, so I suggest studying it carefully if you want to know the details.

And you can grep attr::thread_extent to find where AttrStmt with attr::thread_extent attributes is created in python or c++.