AutoScheduler prints "#Target has been reduced to 1 due to too many failures or duplications" and fails to tune

I am trying to tune the following operator graph with TVM AutoScheduler:

@auto_scheduler.register_workload
def subgraph(B_1, B_2, I, J, K):
    I_1 = tvm.te.placeholder((B_1, B_2, K, I), name="I_1")
    I_2 = tvm.te.placeholder((B_1, B_2, K, I), name="I_2")
    A = topi.multiply(I_1, I_2)
    B = tvm.te.placeholder((B_1, K, B_2, J), name="B")
    B = topi.transpose(B, [0, 2, 1, 3])
    k = tvm.te.reduce_axis((0, K), name="k")
    C = tvm.te.compute(
        (B_1, B_2, I, J),
        lambda b_1, b_2, i, j: tvm.te.sum(A[b_1, b_2, k, i] * B[b_1, b_2, k, j], axis=k),
        name="BatchMatMul"
    )
    return [I_1, I_2, B, C]

I start tuning with:

target = tvm.target.Target("cuda")
task = auto_scheduler.SearchTask(func=subgraph, args=(B_1, B_2, I, J, K), target=target)
measure_ctx = auto_scheduler.LocalRPCMeasureContext(min_repeat_ms=300, timeout=600)
tune_option = auto_scheduler.TuningOptions(
    num_measure_trials=trials,
    measure_callbacks=[auto_scheduler.RecordToFile(logfile)],
    verbose=2,
    runner=measure_ctx.runner
)
task.tune(tune_option)

This produces the following output, but the tuning won’t terminate:

----------------------------------------------------------------------
------------------------------  [ Search ]
----------------------------------------------------------------------
Generate Sketches               #s: 1
Sample Iter: 5  #Pop: 0 #Target: 50     fail_ct: 10240  Time elapsed: 10.68
#Target has been reduced to 25 due to too many failures or duplications
Sample Iter: 10 #Pop: 0 #Target: 25     fail_ct: 20480  Time elapsed: 24.15
#Target has been reduced to 12 due to too many failures or duplications
Sample Iter: 15 #Pop: 0 #Target: 12     fail_ct: 30720  Time elapsed: 37.99
#Target has been reduced to 6 due to too many failures or duplications
Sample Iter: 20 #Pop: 0 #Target: 6      fail_ct: 40960  Time elapsed: 52.01
#Target has been reduced to 3 due to too many failures or duplications
Sample Iter: 25 #Pop: 0 #Target: 3      fail_ct: 51200  Time elapsed: 65.69
#Target has been reduced to 1 due to too many failures or duplications
Sample Iter: 30 #Pop: 0 #Target: 1      fail_ct: 61440  Time elapsed: 80.85
Sample Iter: 35 #Pop: 0 #Target: 1      fail_ct: 71680  Time elapsed: 95.19
Sample Iter: 40 #Pop: 0 #Target: 1      fail_ct: 81920  Time elapsed: 106.89
...

Can you help me figure out what I am doing wrong? Many thanks in advance!

As the message indicated, it means AutoScheduler has difficulty to find the first valid schedule. At this stage, we haven’t really performed any on-device measurement, so a valid schedule is identified by static analysis, which analyzes the lowered IR to estimate the usage of thread numbers and the size of shared memory, and rejects the schedule if they exceed your available GPU resources. You may try to use a smaller shape, or use a larger GPU to see if the problem can be resolved.

Many thanks for your quick reply. I tried reducing the input size to

B_1, B_2, I, J, K = 4, 4, 4, 4, 4

but the problem still persists. Do you have any other idea what could be the problem? Also, does “AutoScheduler has difficulty to find the first valid schedule” mean, that it might be able to tune if I let it run long enough?

Usually this issue cannot be resolved by letting it tune for a longer time. It usually means this task is very tough on the target GPU device. At the fist glance I didn’t see an obvious issue with this compute tho. Maybe you can try to remove the compute piece by piece to see if the original compute is too complex. At least it should be working with only a te.sum

Alco cc @merrymercy @jcf94

I was able to solve this issue myself. It seems the problem comes from these two lines:

B = tvm.te.placeholder((B_1, K, B_2, J), name="B")
B = topi.transpose(B, [0, 2, 1, 3])

Renaming the placeholder to something other than “B” fixes the issue, i.e.:

I_3 = tvm.te.placeholder((B_1, K, B_2, J), name="I_3")
B = topi.transpose(I_3, [0, 2, 1, 3])

Thanks for helping me sort this out.

@jcf94 Maybe this is because we use name hint in ComputeDAG? Could we have a checker in Auto-Scheduler to catch this issue like we did for reduce_axis?