I’m trying to auto-schedule BERT-like models on a GPU. This works successfully on the CPU.
However things fail when running on the GPU. I successfully complete all of the tuning tasks, however when compiling the full model with the logfile, I always crash with the error:
File "/home/wheest/tools/tvm/python/tvm/relay/op/strategy/generic.py", line 767, in _compute_batch_matmul
return [topi_compute(*args)]
File "/home/wheest/tools/tvm/python/tvm/autotvm/task/topi_integration.py", line 165, in wrapper
node = topi_compute(cfg, *args)
File "/home/wheest/tools/tvm/python/tvm/topi/cuda/batch_matmul.py", line 32, in batch_matmul
return nn.batch_matmul(x, y)
File "/home/wheest/tools/tvm/python/tvm/topi/nn/batch_matmul.py", line 57, in batch_matmul
assert len(x_shape) == 3 and len(y_shape) == 3, "only support 3-dim batch_matmul"
Presumably Ansor is trying to do some sort of tiling and it is causing problems.
Is there an auto-scheduling flag or similar which can ensure we don’t break this batch_matmul rule?
Maybe something goes wrong in AutoScheduler when rewriting the layout. You can try to disable the layout rewritten in AutoScheduler by specifying disabled_pass={“AutoSchedulerLayoutRewrite”} in PassContext.
Thanks for the help, though with this approach I got the classic computation outside GPU loop bound error:
Did you forget to bind?
Variable `placeholder` is directly accessed by host memory (it is not contained in a thread environment or in the function arguments.
Variable `placeholder` is directly accessed by host memory (it is not contained in a thread environment or in the function arguments.
Variable `T_dense` is directly accessed by host memory (it is not contained in a thread environment or in the function arguments.
Variable `T_dense` is directly accessed by host memory (it is not contained in a thread environment or in the function arguments.
Variable `T_dense` is directly accessed by host memory (it is not contained in a thread environment or in the function arguments.
File "../src/tir/analysis/verify_memory.cc", line 202
RuntimeError: Memory verification failed with the following errors:
PrimFunc([placeholder, placeholder, T_dense]) attrs={"global_symbol": "fused_nn_dense_73", "tir.noalias": (bool)1, "target": opencl -keys=mali,opencl,gpu -device=mali -max_num_threads=256 -th
read_warp_size=1} {
T_dense[0] = 0f
for (k, 0, 768) {
T_dense[0] = (T_dense[0] + (placeholder[k]*placeholder[k]))
}
}