- TVM uses memory scope to indicate the place to locate buffer.
# you can set it explicitly
s[B].set_scope('local')
s[B].set_scope('shared')
s[B].set_scope('global')
# Sometimes tvm can compute the scope automatically.
s[B].compute_at(s[C], i)
# if i is bound to blockIdx, then the scope of B is 'shared' by default
# if i is bound to threadIdx, then the scope of B is 'local' by default
# tvm will compute the correct size of B by bound inference according to its memory scope.
- If B is very large and you want to put it to global memory, you should allocate that buffer explicitly and pass it to the function built by tvm.
import tvm
import numpy as np
N = 1024
A = tvm.placeholder((N, N), name='A')
B = tvm.compute((N, N), lambda i, j: A[i,j], name='B')
C = tvm.compute((N, N), lambda i, j: B[i,j], name='C')
s = tvm.create_schedule([C.op])
i, j = s[C].op.axis
s[C].bind(i, tvm.thread_axis("blockIdx.x"))
s[B].compute_at(s[C], i) # fuse B to C
# note that we add `B` to the argument list
# If we pass B to the argument list, then the scope of B is `global`
print(tvm.lower(s, [A, B, C], simple_mode=True))
func = tvm.build(s, [A, B, C], 'cuda')
# allocate three buffers and call `func`
a = tvm.nd.array(np.random.randn(N, N).astype(np.float32), ctx=tvm.gpu())
b = tvm.nd.empty((N, N), ctx=tvm.gpu())
c = tvm.nd.empty((N, N), ctx=tvm.gpu())
func(a, b, c)
On the other hand, too large intermediate result often means there is no locality in your computation. You cannot get benefit from the fusion, so you can consider to unfuse them. If you do not use s[B].compute_at
, you needn’t pass B
to the argument list or allocate it explicitly.