[autoscheduler] GPU failure on compile for BERT-like models

I’m trying to auto-schedule BERT-like models on a GPU. This works successfully on the CPU.

However things fail when running on the GPU. I successfully complete all of the tuning tasks, however when compiling the full model with the logfile, I always crash with the error:

File "/home/wheest/tools/tvm/python/tvm/relay/op/strategy/generic.py", line 767, in _compute_batch_matmul
  return [topi_compute(*args)]
File "/home/wheest/tools/tvm/python/tvm/autotvm/task/topi_integration.py", line 165, in wrapper
  node = topi_compute(cfg, *args)
File "/home/wheest/tools/tvm/python/tvm/topi/cuda/batch_matmul.py", line 32, in batch_matmul
  return nn.batch_matmul(x, y)
File "/home/wheest/tools/tvm/python/tvm/topi/nn/batch_matmul.py", line 57, in batch_matmul
    assert len(x_shape) == 3 and len(y_shape) == 3, "only support 3-dim batch_matmul"

Presumably Ansor is trying to do some sort of tiling and it is causing problems.

Is there an auto-scheduling flag or similar which can ensure we don’t break this batch_matmul rule?

Or some other clever trick?

Hmm this is strange. I’ve done a lot of auto scheduling on transformer based models and didn’t encounter issues. cc @comaniac

Maybe something goes wrong in AutoScheduler when rewriting the layout. You can try to disable the layout rewritten in AutoScheduler by specifying disabled_pass={“AutoSchedulerLayoutRewrite”} in PassContext.

Also cc @jcf94

1 Like

Thanks for the help, though with this approach I got the classic computation outside GPU loop bound error:

 Did you forget to bind?
    Variable `placeholder` is directly accessed by host memory (it is not contained in a thread environment or in the function arguments.
    Variable `placeholder` is directly accessed by host memory (it is not contained in a thread environment or in the function arguments.
    Variable `T_dense` is directly accessed by host memory (it is not contained in a thread environment or in the function arguments.
    Variable `T_dense` is directly accessed by host memory (it is not contained in a thread environment or in the function arguments.
    Variable `T_dense` is directly accessed by host memory (it is not contained in a thread environment or in the function arguments.
  File "../src/tir/analysis/verify_memory.cc", line 202
RuntimeError: Memory verification failed with the following errors:
PrimFunc([placeholder, placeholder, T_dense]) attrs={"global_symbol": "fused_nn_dense_73", "tir.noalias": (bool)1, "target": opencl -keys=mali,opencl,gpu -device=mali -max_num_threads=256 -th
read_warp_size=1} {
  T_dense[0] = 0f
  for (k, 0, 768) {
    T_dense[0] = (T_dense[0] + (placeholder[k]*placeholder[k]))
  }
}

Hi I have encountered the same problem. Do you solved this problem yet?

Hi @hope51607 unfortunately I have not found a solution to this problem yet.

@Wheest I solved this problem by adding the following code in python/tvm/relay/op/strategy/mali.py

@batch_matmul_strategy.register("mali")
def batch_matmul_strategy_mali(attrs, inputs, out_type, target):
    """batch_matmul mali strategy"""
    strategy = _op.OpStrategy()
    if not is_auto_scheduler_enabled():
        strategy.add_implementation(
            wrap_compute_batch_matmul(topi.nn.batch_matmul),
            wrap_topi_schedule(topi.nn.batch_matmul),
            name="batch_matmul.mali",
        )
    else:
        strategy.add_implementation(
            wrap_compute_batch_matmul(topi.nn.batch_matmul, need_auto_scheduler_layout=True),
            naive_schedule,
            name="batch_matmul.mali",
        )
    return strategy

I don’t know what target you used. But maybe it also works for you.