I want to measure the latency of a conv2d operator with cuDNN. The operator data shape is (16, 128, 130, 130), and the kernel shape is (128, 128, 3, 3). stride = 1, padding =0, dilation=1 I use the following code to do measurement:
data_np = np.random.uniform(size= (16, 128, 130, 130)).astype(np.float32)
weight_np = np.random.uniform(size=(128, 128, 3, 3)).astype(np.float32)
out_np = conv2d_nchw_python(data_np, weight_np, (sh, sw), padding)
dev = tvm.cuda()
data_tvm = tvm.nd.array(data_np, device=dev)
weight_tvm = tvm.nd.array(weight_np, device=dev)
out_tvm = tvm.nd.empty(out_np.shape, device=dev)
X = te.placeholder((16, 128, 130, 130), name='X')
W = te.placeholder((128, 128, 3, 3), name='W')
Y = tvm.contrib.cudnn.conv_forward(X, W, padding, (1,1), (1,1), 0, 0, -1, None, groups=1)
tensor_args_cudnn = [X, W, Y]
sched_cudnn = te.create_schedule(Y.op)
cudnn_kernel = tvm.build(sched_cudnn, [X, W, Y], target=tvm.target.Target("cuda"))
out_tvm = tvm.nd.empty(out_np.shape, device=dev)
cudnn_kernel(data_tvm, weight_tvm, out_tvm)
# check results of cudnn
np.testing.assert_allclose(out_np, out_tvm.numpy(), rtol=1e-3)
warmup_evaluator = cudnn_kernel.time_evaluator(cudnn_kernel.entry_name, dev, number=3, repeat=1, min_repeat_ms=300)
warmup_evaluator(data_tvm, weight_tvm, out_tvm)
time_evaluator = cudnn_kernel.time_evaluator(cudnn_kernel.entry_name, dev,
number=3, repeat=1, min_repeat_ms=300)
However, the time is strange, which is 0.0008706393072164948 on A100 GPU, so the GFLOPS is 88796.14173998704, while the peak GFLOPS of A100 is less than 20000.
Can anyone point out what’s wrong here? Thanks a lot!