Optimizing matrix multiplication for GPU

How would you do that? I tried the following

    bsz = te.var('bsz')
    d1 = te.var('d1')
    d2 = te.var('d2')
    d3 = te.var('d3')
    A = te.placeholder((bsz, d1, d3), name='A', dtype='float32')  # first tensor
    B = te.placeholder((bsz, d2, d3), name='B', dtype='float32')  # second tensor
    R = topi.nn.batch_matmul(A, B)
    with tvm.target.cuda():
        s = topi.cuda.schedule_reduce(R)

but got

Traceback (most recent call last):

  File "scripts/band_mm_minimal.py", line 94, in <module>
    mm_fun_topi = _compile_function_topi()

  File "scripts/band_mm_minimal.py", line 63, in _compile_function_topi
    s = topi.cuda.schedule_reduce(R)

  File "/usr/tvm/topi/python/topi/cuda/reduction.py", line 147, in schedule_reduce
    traverse_after_reduce(outs[0].op)

  File "/usr/tvm/topi/python/topi/cuda/reduction.py", line 143, in traverse_after_reduce
    raise RuntimeError("Unsupported operator: %s" % operator.tag)

RuntimeError: Unsupported operator: batch_matmul

Which schedule should I use?