AutoScheduler prints "#Target has been reduced to 1 due to too many failures or duplications" and fails to tune

I am trying to tune the following operator graph with TVM AutoScheduler:

@auto_scheduler.register_workload
def subgraph(B_1, B_2, I, J, K):
    I_1 = tvm.te.placeholder((B_1, B_2, K, I), name="I_1")
    I_2 = tvm.te.placeholder((B_1, B_2, K, I), name="I_2")
    A = topi.multiply(I_1, I_2)
    B = tvm.te.placeholder((B_1, K, B_2, J), name="B")
    B = topi.transpose(B, [0, 2, 1, 3])
    k = tvm.te.reduce_axis((0, K), name="k")
    C = tvm.te.compute(
        (B_1, B_2, I, J),
        lambda b_1, b_2, i, j: tvm.te.sum(A[b_1, b_2, k, i] * B[b_1, b_2, k, j], axis=k),
        name="BatchMatMul"
    )
    return [I_1, I_2, B, C]

I start tuning with:

target = tvm.target.Target("cuda")
task = auto_scheduler.SearchTask(func=subgraph, args=(B_1, B_2, I, J, K), target=target)
measure_ctx = auto_scheduler.LocalRPCMeasureContext(min_repeat_ms=300, timeout=600)
tune_option = auto_scheduler.TuningOptions(
    num_measure_trials=trials,
    measure_callbacks=[auto_scheduler.RecordToFile(logfile)],
    verbose=2,
    runner=measure_ctx.runner
)
task.tune(tune_option)

This produces the following output, but the tuning won’t terminate:

----------------------------------------------------------------------
------------------------------  [ Search ]
----------------------------------------------------------------------
Generate Sketches               #s: 1
Sample Iter: 5  #Pop: 0 #Target: 50     fail_ct: 10240  Time elapsed: 10.68
#Target has been reduced to 25 due to too many failures or duplications
Sample Iter: 10 #Pop: 0 #Target: 25     fail_ct: 20480  Time elapsed: 24.15
#Target has been reduced to 12 due to too many failures or duplications
Sample Iter: 15 #Pop: 0 #Target: 12     fail_ct: 30720  Time elapsed: 37.99
#Target has been reduced to 6 due to too many failures or duplications
Sample Iter: 20 #Pop: 0 #Target: 6      fail_ct: 40960  Time elapsed: 52.01
#Target has been reduced to 3 due to too many failures or duplications
Sample Iter: 25 #Pop: 0 #Target: 3      fail_ct: 51200  Time elapsed: 65.69
#Target has been reduced to 1 due to too many failures or duplications
Sample Iter: 30 #Pop: 0 #Target: 1      fail_ct: 61440  Time elapsed: 80.85
Sample Iter: 35 #Pop: 0 #Target: 1      fail_ct: 71680  Time elapsed: 95.19
Sample Iter: 40 #Pop: 0 #Target: 1      fail_ct: 81920  Time elapsed: 106.89
...

Can you help me figure out what I am doing wrong? Many thanks in advance!

As the message indicated, it means AutoScheduler has difficulty to find the first valid schedule. At this stage, we haven’t really performed any on-device measurement, so a valid schedule is identified by static analysis, which analyzes the lowered IR to estimate the usage of thread numbers and the size of shared memory, and rejects the schedule if they exceed your available GPU resources. You may try to use a smaller shape, or use a larger GPU to see if the problem can be resolved.

Many thanks for your quick reply. I tried reducing the input size to

B_1, B_2, I, J, K = 4, 4, 4, 4, 4

but the problem still persists. Do you have any other idea what could be the problem? Also, does “AutoScheduler has difficulty to find the first valid schedule” mean, that it might be able to tune if I let it run long enough?

Usually this issue cannot be resolved by letting it tune for a longer time. It usually means this task is very tough on the target GPU device. At the fist glance I didn’t see an obvious issue with this compute tho. Maybe you can try to remove the compute piece by piece to see if the original compute is too complex. At least it should be working with only a te.sum

Alco cc @merrymercy @jcf94

I was able to solve this issue myself. It seems the problem comes from these two lines:

B = tvm.te.placeholder((B_1, K, B_2, J), name="B")
B = topi.transpose(B, [0, 2, 1, 3])

Renaming the placeholder to something other than “B” fixes the issue, i.e.:

I_3 = tvm.te.placeholder((B_1, K, B_2, J), name="I_3")
B = topi.transpose(I_3, [0, 2, 1, 3])

Thanks for helping me sort this out.

@jcf94 Maybe this is because we use name hint in ComputeDAG? Could we have a checker in Auto-Scheduler to catch this issue like we did for reduce_axis?

I have the same problem.I try a lot of computeDAG(it is generated by my own frontend),but it happened each time.for example, my code sinnpet is following

def fused_52():
	data = te.placeholder((1,3,224,224), name='data', dtype='float32')
	tensor_0 = te.compute((1,3,112,112,7,7), lambda n,c,h1,w1,kh,kw: te.if_then_else(te.all(-3+2*h1+kh>=0,-3+2*h1+kh<224,-3+kw+2*w1>=0,-3+kw+2*w1<224),data[n,c,-3+2*h1+kh,-3+kw+2*w1],0),name = 'tensor_0')
	conv0_weight = te.placeholder((64,3,7,7), name='conv0_weight', dtype='float32')
	c = te.reduce_axis((0,3),name='c')
	kh = te.reduce_axis((0,7),name='kh')
	kw = te.reduce_axis((0,7),name='kw')
	tensor_2 = te.compute((1,64,112,112), lambda n,oc,h1,w1: te.sum(tensor_0[n,c,h1,w1,kh,kw] * conv0_weight[oc,c,kh,kw],axis = [c,kh,kw]), name='tensor_2',)
	return[data,conv0_weight,tensor_2]

it basicly is a conv2d, auto_scheduler code is similar to the above . my enviroment tvm0.9 cuda 10.1

nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.67       Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla T4            Off  | 00000000:21:01.0 Off |                    0 |
| N/A   43C    P0    28W /  70W |      0MiB / 15079MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

it does not contain the same name,so I cannot fix it by rename . is that a bug of Ansor? Thanks.

Have you tried to reduce the tensor size?

sorry,I made a mistake, this function works fine, it’s another function which has same tensor name. so I can fix it by rename tensor name. By the way, I have a frontend that generated TE,the shape size is in practical sceniro like resnet50,so reducing the tensor size is not option for me.

hi I met the same problem, when I use llvm as the target the autoscheduler works fine, but when I change the target into cuda, when executing “tuner.tune”, it prints"Target has been reduced to 6 due to too many failures or duplications". Could you help? The whole log is presented below.

Begin tuning...
----------------------------------------------------------------------
------------------------------  [ Task Scheduler ]
----------------------------------------------------------------------
|  ID  |                       Task Description                        | Latency (ms) | Speed (GFLOPS) | Trials |
-----------------------------------------------------------------------------------------------------------------
|    0 |                                       vm_mod_fused_variance_1 |            - |              - |      0 |
|    1 |                    vm_mod_fused_nn_conv2d_add_nn_leaky_relu_3 |            - |              - |      0 |
|    2 |                      vm_mod_fused_nn_conv2d_add_nn_leaky_relu |            - |              - |      0 |
|    3 |                                       vm_mod_fused_variance_3 |            - |              - |      0 |
|    4 |                                    vm_mod_fused_nn_conv2d_add |            - |              - |      0 |
|    5 |                                       vm_mod_fused_variance_2 |            - |              - |      0 |
|    6 |                    vm_mod_fused_nn_conv2d_add_nn_leaky_relu_1 |            - |              - |      0 |
|    7 |                                             vm_mod_fused_mean |            - |              - |      0 |
|    8 |                    vm_mod_fused_nn_conv2d_add_nn_leaky_relu_2 |            - |              - |      0 |
|    9 |                                           vm_mod_fused_mean_1 |            - |              - |      0 |
|   10 |                                           vm_mod_fused_mean_3 |            - |              - |      0 |
|   11 |                                           vm_mod_fused_mean_2 |            - |              - |      0 |
|   12 |                                         vm_mod_fused_variance |            - |              - |      0 |
|   13 |                    vm_mod_fused_nn_conv2d_add_nn_leaky_relu_4 |            - |              - |      0 |
-----------------------------------------------------------------------------------------------------------------
Estimated total latency: - ms   Trials: 0       Used time : 0 s Next ID: 0
----------------------------------------------------------------------
------------------------------  [ Search ]
----------------------------------------------------------------------
Generate Sketches               #s: 2
Sample Iter: 5  #Pop: 10        #Target: 50     fail_ct: 10230  Time elapsed: 2.26
#Target has been reduced to 25 due to too many failures or duplications
Sample Iter: 10 #Pop: 10        #Target: 25     fail_ct: 20470  Time elapsed: 4.48
#Target has been reduced to 12 due to too many failures or duplications
Sample Iter: 15 #Pop: 10        #Target: 12     fail_ct: 30710  Time elapsed: 6.73
#Target has been reduced to 6 due to too many failures or duplications
Sample Initial Population       #s: 10  fail_ct: 32758  Time elapsed: 7.18
GA Iter: 0      Max score: 0.8296       Min score: 0.0357       #Pop: 10        #M+: 0  #M-: 0
GA Iter: 4      Max score: 0.8296       Min score: 0.0357       #Pop: 10        #M+: 599        #M-: 5270
EvolutionarySearch              #s: 10  Time elapsed: 1.55
----------------------------------------------------------------------
------------------------------  [ Measure ]
----------------------------------------------------------------------
Get 10 programs to measure:
..........*ETraceback (most recent call last):
  File "/home/elle/anaconda3/envs/Lu1/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/elle/anaconda3/envs/Lu1/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/elle/.vscode/extensions/ms-python.python-2021.12.1559732655/pythonFiles/lib/python/debugpy/__main__.py", line 45, in <module>
    cli.main()
  File "/home/elle/.vscode/extensions/ms-python.python-2021.12.1559732655/pythonFiles/lib/python/debugpy/../debugpy/server/cli.py", line 444, in main
    run()
  File "/home/elle/.vscode/extensions/ms-python.python-2021.12.1559732655/pythonFiles/lib/python/debugpy/../debugpy/server/cli.py", line 285, in run_file
    runpy.run_path(target_as_str, run_name=compat.force_str("__main__"))
  File "/home/elle/anaconda3/envs/Lu1/lib/python3.9/runpy.py", line 268, in run_path
    return _run_module_code(code, init_globals, run_name,
  File "/home/elle/anaconda3/envs/Lu1/lib/python3.9/runpy.py", line 97, in _run_module_code
    _run_code(code, mod_globals, init_globals,
  File "/home/elle/anaconda3/envs/Lu1/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/elle/bing/proj/code/tvm1-Image-Adaptive-3DLUT/use-LUT-model-and-TVM-to-tune.py", line 192, in <module>
    run_tuning()
  File "/home/elle/bing/proj/code/tvm1-Image-Adaptive-3DLUT/use-LUT-model-and-TVM-to-tune.py", line 190, in run_tuning
    tuner.tune(tune_option)
  File "/home/elle/bing/proj/tvm/python/tvm/auto_scheduler/task_scheduler.py", line 360, in tune
    self._tune_task(idx)
  File "/home/elle/bing/proj/tvm/python/tvm/auto_scheduler/task_scheduler.py", line 455, in _tune_task
    measure_inputs, measure_results = self.search_policies[task_idx].continue_search_one_round(
  File "/home/elle/bing/proj/tvm/python/tvm/auto_scheduler/search_policy.py", line 119, in continue_search_one_round
    return _ffi_api.SearchPolicyContinueSearchOneRound(self, num_measure, measurer)
  File "/home/elle/bing/proj/tvm/python/tvm/_ffi/_ctypes/packed_func.py", line 237, in __call__
    raise get_last_ffi_error()
ValueError: Traceback (most recent call last):
  [bt] (7) /home/elle/bing/proj/tvm/build/libtvm.so(TVMFuncCall+0x57) [0x7f9261f75457]
  [bt] (6) /home/elle/bing/proj/tvm/build/libtvm.so(+0x2f39b92) [0x7f9260a2db92]
  [bt] (5) /home/elle/bing/proj/tvm/build/libtvm.so(tvm::auto_scheduler::SketchPolicyNode::ContinueSearchOneRound(int, tvm::auto_scheduler::ProgramMeasurer)+0x3a0) [0x7f9260a3cab0]
  [bt] (4) /home/elle/bing/proj/tvm/build/libtvm.so(tvm::auto_scheduler::ProgramMeasurerNode::Measure(tvm::auto_scheduler::SearchTask const&, tvm::auto_scheduler::SearchPolicy const&, tvm::runtime::Array<tvm::auto_scheduler::MeasureInput, void> const&, int)+0x483) [0x7f92609fe7d3]
  [bt] (3) /home/elle/bing/proj/tvm/build/libtvm.so(tvm::auto_scheduler::ProgramMeasurerNode::SilentMeasure(tvm::auto_scheduler::SearchTask const&, tvm::runtime::Array<tvm::auto_scheduler::MeasureInput, void> const&, tvm::runtime::Array<tvm::auto_scheduler::MeasureResult, void>*)+0x101) [0x7f92609fcd21]
  [bt] (2) /home/elle/bing/proj/tvm/build/libtvm.so(tvm::auto_scheduler::LocalRunnerNode::Run(tvm::runtime::Array<tvm::auto_scheduler::MeasureInput, void> const&, tvm::runtime::Array<tvm::auto_scheduler::BuildResult, void> const&, int)+0x19a) [0x7f92609fdeaa]
  [bt] (1) /home/elle/bing/proj/tvm/build/libtvm.so(+0x2be9122) [0x7f92606dd122]
  [bt] (0) /home/elle/bing/proj/tvm/build/libtvm.so(tvm::runtime::Backtrace[abi:cxx11]()+0x2c) [0x7f9261f9333c]
  File "/home/elle/bing/proj/tvm/python/tvm/_ffi/_ctypes/packed_func.py", line 81, in cfun
    rv = local_pyfunc(*pyargs)
  File "/home/elle/bing/proj/tvm/python/tvm/auto_scheduler/measure.py", line 1026, in local_run
    res = call_func_with_timeout(
  File "/home/elle/bing/proj/tvm/python/tvm/auto_scheduler/utils.py", line 293, in call_func_with_timeout
    worker.send(func, args, kwargs, timeout)
  File "/home/elle/bing/proj/tvm/python/tvm/contrib/popen_pool.py", line 244, in send
    self._writer.write(struct.pack("<i", len(data)))
ValueError: write to closed file