Help with TOPI Functions

I am trying to use batch_matmul function for cuda. However, I keep getting error. RuntimeError: Cannot find TOPI workload batch_matmul.cuda. Is it registered with `register_topi_compute`? This error is raised while executing

s = topi.cuda.schedule_batch_matmul(R)

The entire code is as follows.

import torch
from tvm.topi import topi
import tvm
from tvm import te
from tvm.contrib import dlpack

def _codegen_function(d1, d2, name):
    bsz = te.var('bsz') # bsz and d3 can be variables without impact on performance 
    d3 = te.var('d3')   # but d1 and d2 should be constants for `schedule_batch_matmul` to work
    A = te.placeholder((bsz, d1, d3), name='A', dtype='float32')
    B = te.placeholder((bsz, d2, d3), name='B', dtype='float32')
    R = topi.nn.batch_matmul(A, B)
    with tvm.target.create("cuda"):
        s = topi.cuda.schedule_batch_matmul(R)
    return tvm.lower(s, [A, B, R], name=name)

if __name__ == "__main__":
    bsz = 12
    d1 = 2048
    d2 = 512
    d3 = 64

    bmm = _codegen_function(d1, d2, 'bmm') 

    A = torch.randn(bsz, d1, d3, device='cuda')
    B = torch.randn(bsz, d2, d3, device='cuda')
    R = B.new_empty(bsz, d1, d2) 
   
    bmm_pytorch(A, B, R)

Hi, @gkolhe Unless you want to apply specific schedule in TOPI, default schedule should work.

R = topi.nn.batch_matmul(A, B)
s = te.create_schedule(R.op)
return tvm.build(s, [A, B, R], name=name)

I followed the instructions as you suggested but I get following errors. Please find my code here as well.

import torch
from tvm.topi import topi
import tvm
from tvm import te
from tvm.contrib import dlpack

def _codegen_function(name):
    d1 = te.var('d1')   # D1 -> # of rows of first matrix
    d2 = te.var('d2')   # D2 -> # of columns of first matrix
    bsz = te.var('bsz') # bsz and d3 can be variables without impact on performance 
    d3 = te.var('d3')   
    A = te.placeholder((bsz, d1, d3), name='A', dtype='float32')
    B = te.placeholder((bsz, d2, d3), name='B', dtype='float32')
    R = topi.nn.batch_matmul(A, B)
    s = te.create_schedule(R.op)
    return tvm.build(s, [A, B, R], name=name, target = 'cuda')

if __name__ == "__main__":
    bsz = 12
    d1 = 2048
    d2 = 1024
    d3 = 64

    bmm1 = _codegen_function('bmm1') 
    bmm1_pytorch = dlpack.to_pytorch_func(bmm1)  # wrap it as a pytorch function

    A = torch.randn(bsz, d1, d3, device='cuda')
    B = torch.randn(bsz, d2, d3, device='cuda')
    R = B.new_empty(bsz, d1, d2)  # allocate memory for the result tensor
   
    bmm1_pytorch(A, B, R)

Error:

TVMError: Traceback (most recent call last):
  10: TVMFuncCall
  9: tvm::runtime::PackedFuncObj::Extractor<tvm::runtime::PackedFuncSubObj<tvm::runtime::TypedPackedFunc<tvm::runtime::Module (tvm::runtime::Map<tvm::Target, tvm::IRModule, void, void> const&, tvm::Target)>::AssignTypedLambda<tvm::$_5>(tvm::$_5, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >)::{lambda(tvm::runtime::TVMArgs const&, tvm::runtime::TVMRetValue*)#1}> >::Call(tvm::runtime::PackedFuncObj const*, tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)
  8: tvm::TIRToRuntime(tvm::runtime::Map<tvm::Target, tvm::IRModule, void, void> const&, tvm::Target const&)
  7: tvm::SplitMixedModule(tvm::IRModule, tvm::Target const&, tvm::Target const&)
  6: tvm::ApplyPasses(tvm::IRModule, tvm::transform::Sequential)
  5: tvm::transform::Pass::operator()(tvm::IRModule) const
  4: tvm::transform::Pass::operator()(tvm::IRModule, tvm::transform::PassContext const&) const
  3: tvm::transform::SequentialNode::operator()(tvm::IRModule, tvm::transform::PassContext const&) const
  2: tvm::transform::Pass::operator()(tvm::IRModule, tvm::transform::PassContext const&) const
  1: tvm::transform::ModulePassNode::operator()(tvm::IRModule, tvm::transform::PassContext const&) const
  0: tvm::runtime::PackedFuncObj::Extractor<tvm::runtime::PackedFuncSubObj<tvm::runtime::TypedPackedFunc<tvm::IRModule (tvm::IRModule, tvm::transform::PassContext)>::AssignTypedLambda<tvm::tir::transform::VerifyMemory()::$_0>(tvm::tir::transform::VerifyMemory()::$_0)::{lambda(tvm::runtime::TVMArgs const&, tvm::runtime::TVMRetValue*)#1}> >::Call(tvm::runtime::PackedFuncObj const*, tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)
  Did you forget to bind?
    Variable `B` is directly accessed by host memory (it is not contained in a thread environment or in the function arguments.
    Variable `A` is directly accessed by host memory (it is not contained in a thread environment or in the function arguments.
    Variable `T_batch_matmul_NT` is directly accessed by host memory (it is not contained in a thread environment or in the function arguments.
    Variable `T_batch_matmul_NT` is directly accessed by host memory (it is not contained in a thread environment or in the function arguments.
    Variable `T_batch_matmul_NT` is directly accessed by host memory (it is not contained in a thread environment or in the function arguments.
  File "/home/gakolhe/tvm/src/tir/analysis/verify_memory.cc", line 214
RuntimeError: Memory verification failed with the following errors:
PrimFunc([A, B, T_batch_matmul_NT]) attrs={"from_legacy_te_schedule": (bool)1, "global_symbol": "bmm1", "tir.noalias": (bool)1, "target": cuda -keys=cuda,gpu -arch=sm_60 -max_num_threads=1024 -thread_warp_size=32} {
  for (b, 0, {batch|batch>=0}) {
    for (i, 0, d1) {
      for (j, 0, d2) {
        T_batch_matmul_NT[(((b*stride) + (i*stride)) + (j*stride))] = 0f
        for (k, 0, d3) {
          T_batch_matmul_NT[(((b*stride) + (i*stride)) + (j*stride))] = (T_batch_matmul_NT[(((b*stride) + (i*stride)) + (j*stride))] + (A[(((b*stride) + (i*stride)) + (k*stride))]*B[(((b*stride) + (j*stride)) + (k*stride))]))
        }
      }
    }
  }
}

oh, it seems like default schedule is not valid for cuda. You can apply the tuned schedule or your manual schedule to make it work, but I cannot think of any convenient interface at TE-level now.

Is there any specific reason for you to stay in TE? Otherwise, generating code from relay op by using relay.build would be more generally recommended.

+ Edit: schedule under topi (e.g., topi.cuda.schedule_xxx) is for AutoTVM. If you don’t need to stick with AutoTVM, MetaSchedule (the latest tuning technology) actually provides the tuning interface for TE and this also works with the static shape.

bsz = 12   # static
d3 = 20     # static
A = te.placeholder((bsz, d1, d3), name="A", dtype="float32")
B = te.placeholder((bsz, d2, d3), name="B", dtype="float32")
R = topi.cuda.batch_matmul(A, B)
with tempfile.TemporaryDirectory() as work_dir:
    sch: Schedule = tune_te(
        tensors=(A, B, R),
        target=tvm.target.Target("nvidia/geforce-rtx-3070"),
        config=TuneConfig(
            strategy="replay_trace",
            num_trials_per_iter=1,
            max_trials_per_task=1,
            max_trials_global=1,
        ),
        work_dir=work_dir,
    )
    if sch is None:
        print("No valid schedule found!")
    else:
        print(sch.mod.script())
        print(sch.trace)
    return tvm.build(sch.mod, [A, B, R], name=name, target="cuda")

Okay, after playing around a little, I’ve just realized you were using dynamic shape and this might be the cause since the implementation and schedule you are applying does not support it. The following code works for me.

def _codegen_function(d1, d2, name):
   bsz = 12
   d3 = 20
   A = te.placeholder((bsz, d1, d3), name="A", dtype="float32")
   B = te.placeholder((bsz, d2, d3), name="B", dtype="float32")
   R = topi.cuda.batch_matmul(A, B)
   with tvm.target.Target("cuda"):
      s = topi.cuda.schedule_batch_matmul(R)
      return tvm.build(s, [A, B, R], name=name)

If you want to try dynamic shape, I think it would better to try with the relay op and relay build since they will go through the mechanism to find the right implementation and schedule for each context.

1 Like

Hi @sunggg

Thanks for the help. May I get some insight on what you mean by relay op and relay build?

Are there any examples of this?

Something like this

d3 = 20
bsz = 10
x = relay.var("x", shape=(bsz, d1, d3), dtype="float32")
y = relay.var("y", shape=(bsz, d2, d3), dtype="float32")
out = relay.nn.batch_matmul(x, y)
f = relay.Function([x, y], out)
with tvm.target.Target("cuda") as target:
    lib = relay.build(f)

you can try relay.Any() to imply dynamic shape value, but the performance may not be good since tuning workloads with dynamic shape is WIP.