Did you get the Relay module working without auto-tuning? Is your target GPU?
error_no=4 is related to the runtime error, which may be caused by vary reasons (e.g., out of device memory, etc). If AutoTVM consistently failed, then it’s highly possible that the Relay module will fail even without AutoTVM.
Did you get the Relay module working without auto-tuning?
When I remove tuning calls, it says:
Cannot find config for target=llvm, workload=('matmul', 2048, 2048, 2048, 'float32'). A fallback
configuration is used, which may bring great performance regression.
GFlops: 0.9504301626367904
What’s your matmul op comes from? Did you implement that by yourself following the tutorial? If you just want to get the matrix multiplication working, you can directly use the TOPI builtin dense. For example:
import tvm
from tvm import relay
B = 2048
I = 2048
O = 2048
x = relay.var("x", shape=(B, I), dtype=dtype)
w = relay.var("w", shape=(O, I), dtype=dtype)
net = relay.nn.dense(x, w)
module = relay.Module.from_expr(net)
module = relay.transform.InferType()(module)
target = 'llvm -mcpu=???' # Your CPU model
tasks = autotvm.task.extract_from_program(module['main'], target=target,
params={}, ops=(relay.op.nn.dense, ))
tune_tasks(tasks, **tuning_option)
If you prefer to know more details about how schedule template is implemented and how AutoTVM works, please post your implementation for fuether investigation.
Looking at your declarations of x and w I started thinking why it’s O, I and not I, O. I am looking at topi.nn.dense, which matches your input declaration as it will do $XW^{T}$ vs relay.dense seems to do $X \times W$.
So performance wise dense won’t be same as matmult?
I’ll try to check output of matmult vs dense.
Just to clarify for my understanding, if I were to do matmult of two tensors A, B using nn.dense I should do explicit transpose of B before feeding to primitive. And even though with this trick, I’ll get correct results the performance of doing this vs. doing matmult directly would be different.
Also I think there should be tutorial for running relay function on the lines of this. Maybe such tutorial already exists and I didn’t find it?
Thanks a bunch for answering! .
The short answer is yes, you have to explicitly transpose B first. I believe this layout is better for DNN workloads and that’s why it is designed in this way.
For the tutorial, we have a similar one but that’s for TOPI. I didn’t remember we have one for Relay. You are welcome to contribute one if possible