Data allocation of GPU kernel fusion

I have two questions regarding data allocation related to operator fusion for GPU.

  1. Suppose that there are two fused CUDA kernels and the output from the first kernel is used for the input of the second kernel. With operator fusion for GPU done, after executing the first kernel, where is the output is located? Is it in global memory?

  2. The paper described the purpose of GPU kernel fusion is to avoid saving intermediate results back into global memory. What if a couple of operators are fused into a single CUDA kernel, but intermediate results from the fused kernel become too big for shared memory, then what would happen in NNVM/TVM? Is there any code in NNVM/TVM handles such a case or is it up to GPU’s cache management?

Thanks!

  1. TVM uses memory scope to indicate the place to locate buffer.
# you can set it explicitly
s[B].set_scope('local')
s[B].set_scope('shared')
s[B].set_scope('global')

# Sometimes tvm can compute the scope automatically.
s[B].compute_at(s[C], i)
# if i is bound to blockIdx, then the scope of B is 'shared' by default
# if i is bound to threadIdx, then the scope of B is 'local' by default

# tvm will compute the correct size of B by bound inference according to its memory scope.
  1. If B is very large and you want to put it to global memory, you should allocate that buffer explicitly and pass it to the function built by tvm.
import tvm
import numpy as np

N = 1024

A = tvm.placeholder((N, N), name='A')
B = tvm.compute((N, N), lambda i, j: A[i,j], name='B')
C = tvm.compute((N, N), lambda i, j: B[i,j], name='C')

s = tvm.create_schedule([C.op])

i, j = s[C].op.axis
s[C].bind(i, tvm.thread_axis("blockIdx.x"))

s[B].compute_at(s[C], i) # fuse B to C

# note that we add `B` to the argument list
# If we pass B to the argument list, then the scope of B is `global`
print(tvm.lower(s, [A, B, C], simple_mode=True))
func = tvm.build(s, [A, B, C], 'cuda')

# allocate three buffers and call `func`
a = tvm.nd.array(np.random.randn(N, N).astype(np.float32), ctx=tvm.gpu())
b = tvm.nd.empty((N, N), ctx=tvm.gpu())
c = tvm.nd.empty((N, N), ctx=tvm.gpu())
func(a, b, c)

On the other hand, too large intermediate result often means there is no locality in your computation. You cannot get benefit from the fusion, so you can consider to unfuse them. If you do not use s[B].compute_at, you needn’t pass B to the argument list or allocate it explicitly.