Runtime too long for the demo on GPU

Expected behavior

I installed the latest TVM from source, and managed to compile Resnet50 following your tutorial. By setting the target flag as “llvm”, I did observe a speedup from resnet50-v2-7-tvm.tar to resnet50-v2-7_autotuned.tar.

However, when setting the target flag as “cuda”, the autotuned version is slower than the non-autotuned version on GPU. And I would like to ask whether anyone has observed similar behavior before, or is there anything I did in a wrong way.

Environment

Ubuntu 20.04 with 3080 Ti CUDA 11.2 with driver 460.91.03 TVM version: 0.8.dev0 LLVM version: 13.0.0

Steps to reproduce

Steps to generate and test the non-autotuned version:

tvmc compile --target "cuda" --output resnet50-v2-7-tvm-cuda.tar resnet50-v2-7.onnx
tvmc run --device cuda --inputs imagenet_cat.npz --output predictions.npz --print-time --repeat 100 resnet50-v2-7-tvm-cuda.tar

My terminal output for non-autotuned version:

Execution time summary:
 mean (ms)   median (ms)    max (ms)     min (ms)     std (ms)  
   3.4426       3.4623       5.6511       3.0480       0.4118

Steps to generate and test the autotuned version:

tvmc tune --target "cuda" --output resnet50-v2-7-autotuner_records-cuda.json resnet50-v2-7.onnx
tvmc compile --target "cuda" --tuning-records resnet50-v2-7-autotuner_records-cuda.json --output resnet50-v2-7-tvm_autotuned-cuda.tar resnet50-v2-7.onnx
tvmc run --device cuda --inputs imagenet_cat.npz --output predictions.npz --print-time --repeat 100 resnet50-v2-7-tvm_autotuned-cuda.tar

My terminal output for autotuned version:

Execution time summary:
 mean (ms)   median (ms)    max (ms)     min (ms)     std (ms)  
   4.8350       5.0163       7.8554       4.4040       0.5398

From the above, we can find that the autotuned one takes longer time than the non-autotuned one. I am new to TVM. So my “bug” might be naive. Any help will be greatly appreciated, and thanks in advance! :slight_smile:

1 Like

Hey sugartom, this is highly interesting. I believe I have seen this before though not personally. @rkimball has encountered this before. IIRC this problem was due to layout transformations (e.g. NHWC → NCHW) being added when combining kernels of not compatible layout.

One solution is graph level tuning which tries combinations of all the kernels with different layout transforms. https://github.com/apache/tvm/issues/1585

From the tutorials:

# Use graph tuner to achieve graph level optimal schedules
# Set use_DP=False if it takes too long to finish.
def tune_graph(graph, dshape, records, opt_sch_file, use_DP=True):
    target_op = [
        relay.op.get("nn.conv2d"),
    ]
    Tuner = DPTuner if use_DP else PBQPTuner
    executor = Tuner(graph, {input_name: dshape}, records, target_op, target)
    executor.benchmark_layout_transform(min_exec_num=2000)
    executor.run()
    executor.write_opt_sch2record_file(opt_sch_file)

I have noticed a similar problem with the AutoTVM tutorial. I used the original autotvm_relay_x86.py file, only changing the line with the target, and TVM at c00ce37 with LLVM 13.0.0 and CUDA 11.4.

Timings

target = "llvm"

optimized: {'mean': 57.881524069998704, 'median': 57.75254489999497, 'std': 0.30390766541859005}
unoptimized: {'mean': 64.81794787000581, 'median': 64.85998775001462, 'std': 0.18545348916908125}

target = "llvm -mcpu=skylake-avx512"

optimized: {'mean': 29.11659942001279, 'median': 29.053596499989, 'std': 0.19929450522023998}
unoptimized: {'mean': 24.937650030005898, 'median': 24.873016900028233, 'std': 0.16893309752396393}

target = "cuda"

optimized: {'mean': 18.921251980018496, 'median': 21.12027845000739, 'std': 6.681149567023143}
unoptimized: {'mean': 6.638219979995483, 'median': 7.205596800031344, 'std': 2.198990796372162}
Complete output
One or more operators have not been tuned. Please tune your model for better performance. Use DEBUG logging level to see more details.
{'mean': 6.638219979995483, 'median': 7.205596800031344, 'std': 2.198990796372162}
class='n02123045 tabby, tabby cat' with probability=0.610552
class='n02123159 tiger cat' with probability=0.367179
class='n02124075 Egyptian cat' with probability=0.019365
class='n02129604 tiger, Panthera tigris' with probability=0.001273
class='n04040759 radiator' with probability=0.000261

[Task  1/28]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/10) | 0.00 s
[Task  1/28]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (10/10) | 13.75 sWARNING:root:Could not find any valid schedule for task Task(func_name=conv2d_nchw.cuda, args=(('TENSOR', (1, 3, 224, 224), 'float32'), ('TENSOR', (64, 3, 7, 7), 'float32'), (2, 2), (3, 3, 3, 3), (1, 1), 'float32'), kwargs={}, workload=('conv2d_nchw.cuda', ('TENSOR', (1, 3, 224, 224), 'float32'), ('TENSOR', (64, 3, 7, 7), 'float32'), (2, 2), (3, 3, 3, 3), (1, 1), 'float32')). A file containing the errors has been written to /tmp/tvm_tuning_errors_7doepwif.log.
 Done.

[Task  2/28]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/10) | 0.00 s
[Task  2/28]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (10/10) | 14.36 sWARNING:root:Could not find any valid schedule for task Task(func_name=conv2d_nchw.cuda, args=(('TENSOR', (1, 64, 56, 56), 'float32'), ('TENSOR', (64, 64, 1, 1), 'float32'), (1, 1), (0, 0, 0, 0), (1, 1), 'float32'), kwargs={}, workload=('conv2d_nchw.cuda', ('TENSOR', (1, 64, 56, 56), 'float32'), ('TENSOR', (64, 64, 1, 1), 'float32'), (1, 1), (0, 0, 0, 0), (1, 1), 'float32')). A file containing the errors has been written to /tmp/tvm_tuning_errors_wohln07u.log.
 Done.

[Task  3/28]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/10) | 0.00 s
[Task  3/28]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (10/10) | 15.52 sWARNING:root:Could not find any valid schedule for task Task(func_name=conv2d_nchw.cuda, args=(('TENSOR', (1, 64, 56, 56), 'float32'), ('TENSOR', (64, 64, 3, 3), 'float32'), (1, 1), (1, 1, 1, 1), (1, 1), 'float32'), kwargs={}, workload=('conv2d_nchw.cuda', ('TENSOR', (1, 64, 56, 56), 'float32'), ('TENSOR', (64, 64, 3, 3), 'float32'), (1, 1), (1, 1, 1, 1), (1, 1), 'float32')). A file containing the errors has been written to /tmp/tvm_tuning_errors_rzg9iv15.log.
 Done.

[Task  4/28]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/10) | 0.00 s
[Task  4/28]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (10/10) | 10.91 sWARNING:root:Could not find any valid schedule for task Task(func_name=conv2d_nchw_winograd.cuda, args=(('TENSOR', (1, 64, 56, 56), 'float32'), ('TENSOR', (64, 64, 3, 3), 'float32'), (1, 1), (1, 1, 1, 1), (1, 1), 'float32'), kwargs={}, workload=('conv2d_nchw_winograd.cuda', ('TENSOR', (1, 64, 56, 56), 'float32'), ('TENSOR', (64, 64, 3, 3), 'float32'), (1, 1), (1, 1, 1, 1), (1, 1), 'float32')). A file containing the errors has been written to /tmp/tvm_tuning_errors_4naydwvt.log.
 Done.

[Task  5/28]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/10) | 0.00 s
[Task  5/28]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (10/10) | 11.21 sWARNING:root:Could not find any valid schedule for task Task(func_name=conv2d_nchw.cuda, args=(('TENSOR', (1, 64, 56, 56), 'float32'), ('TENSOR', (256, 64, 1, 1), 'float32'), (1, 1), (0, 0, 0, 0), (1, 1), 'float32'), kwargs={}, workload=('conv2d_nchw.cuda', ('TENSOR', (1, 64, 56, 56), 'float32'), ('TENSOR', (256, 64, 1, 1), 'float32'), (1, 1), (0, 0, 0, 0), (1, 1), 'float32')). A file containing the errors has been written to /tmp/tvm_tuning_errors_fqta3hmb.log.
 Done.

[Task  6/28]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/10) | 0.00 s
[Task  6/28]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (10/10) | 10.03 sWARNING:root:Could not find any valid schedule for task Task(func_name=conv2d_nchw.cuda, args=(('TENSOR', (1, 256, 56, 56), 'float32'), ('TENSOR', (64, 256, 1, 1), 'float32'), (1, 1), (0, 0, 0, 0), (1, 1), 'float32'), kwargs={}, workload=('conv2d_nchw.cuda', ('TENSOR', (1, 256, 56, 56), 'float32'), ('TENSOR', (64, 256, 1, 1), 'float32'), (1, 1), (0, 0, 0, 0), (1, 1), 'float32')). A file containing the errors has been written to /tmp/tvm_tuning_errors_xrzw_zb1.log.
 Done.

[Task  7/28]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/10) | 0.00 s
[Task  7/28]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (10/10) | 10.57 sWARNING:root:Could not find any valid schedule for task Task(func_name=conv2d_nchw.cuda, args=(('TENSOR', (1, 256, 56, 56), 'float32'), ('TENSOR', (128, 256, 1, 1), 'float32'), (1, 1), (0, 0, 0, 0), (1, 1), 'float32'), kwargs={}, workload=('conv2d_nchw.cuda', ('TENSOR', (1, 256, 56, 56), 'float32'), ('TENSOR', (128, 256, 1, 1), 'float32'), (1, 1), (0, 0, 0, 0), (1, 1), 'float32')). A file containing the errors has been written to /tmp/tvm_tuning_errors_yx3111i7.log.
 Done.

[Task  8/28]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/10) | 0.00 s
[Task  8/28]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (10/10) | 10.98 sWARNING:root:Could not find any valid schedule for task Task(func_name=conv2d_nchw.cuda, args=(('TENSOR', (1, 128, 56, 56), 'float32'), ('TENSOR', (128, 128, 3, 3), 'float32'), (2, 2), (1, 1, 1, 1), (1, 1), 'float32'), kwargs={}, workload=('conv2d_nchw.cuda', ('TENSOR', (1, 128, 56, 56), 'float32'), ('TENSOR', (128, 128, 3, 3), 'float32'), (2, 2), (1, 1, 1, 1), (1, 1), 'float32')). A file containing the errors has been written to /tmp/tvm_tuning_errors_75eb995i.log.
 Done.

[Task  9/28]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/10) | 0.00 s
[Task  9/28]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (10/10) | 8.82 sWARNING:root:Could not find any valid schedule for task Task(func_name=conv2d_nchw.cuda, args=(('TENSOR', (1, 256, 56, 56), 'float32'), ('TENSOR', (512, 256, 1, 1), 'float32'), (2, 2), (0, 0, 0, 0), (1, 1), 'float32'), kwargs={}, workload=('conv2d_nchw.cuda', ('TENSOR', (1, 256, 56, 56), 'float32'), ('TENSOR', (512, 256, 1, 1), 'float32'), (2, 2), (0, 0, 0, 0), (1, 1), 'float32')). A file containing the errors has been written to /tmp/tvm_tuning_errors_b4b6seui.log.
 Done.

[Task 10/28]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/10) | 0.00 s
[Task 10/28]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (10/10) | 12.91 sWARNING:root:Could not find any valid schedule for task Task(func_name=conv2d_nchw.cuda, args=(('TENSOR', (1, 128, 28, 28), 'float32'), ('TENSOR', (512, 128, 1, 1), 'float32'), (1, 1), (0, 0, 0, 0), (1, 1), 'float32'), kwargs={}, workload=('conv2d_nchw.cuda', ('TENSOR', (1, 128, 28, 28), 'float32'), ('TENSOR', (512, 128, 1, 1), 'float32'), (1, 1), (0, 0, 0, 0), (1, 1), 'float32')). A file containing the errors has been written to /tmp/tvm_tuning_errors_6mh2ss5h.log.
 Done.

[Task 11/28]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/10) | 0.00 s
[Task 11/28]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (10/10) | 12.04 sWARNING:root:Could not find any valid schedule for task Task(func_name=conv2d_nchw.cuda, args=(('TENSOR', (1, 512, 28, 28), 'float32'), ('TENSOR', (128, 512, 1, 1), 'float32'), (1, 1), (0, 0, 0, 0), (1, 1), 'float32'), kwargs={}, workload=('conv2d_nchw.cuda', ('TENSOR', (1, 512, 28, 28), 'float32'), ('TENSOR', (128, 512, 1, 1), 'float32'), (1, 1), (0, 0, 0, 0), (1, 1), 'float32')). A file containing the errors has been written to /tmp/tvm_tuning_errors_upcbvqql.log.
 Done.

[Task 12/28]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/10) | 0.00 s
[Task 12/28]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (10/10) | 18.84 sWARNING:root:Could not find any valid schedule for task Task(func_name=conv2d_nchw.cuda, args=(('TENSOR', (1, 128, 28, 28), 'float32'), ('TENSOR', (128, 128, 3, 3), 'float32'), (1, 1), (1, 1, 1, 1), (1, 1), 'float32'), kwargs={}, workload=('conv2d_nchw.cuda', ('TENSOR', (1, 128, 28, 28), 'float32'), ('TENSOR', (128, 128, 3, 3), 'float32'), (1, 1), (1, 1, 1, 1), (1, 1), 'float32')). A file containing the errors has been written to /tmp/tvm_tuning_errors_cd_pbydl.log.

[Task 13/28]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/10) | 0.00 s
[Task 13/28]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (10/10) | 9.46 sWARNING:root:Could not find any valid schedule for task Task(func_name=conv2d_nchw_winograd.cuda, args=(('TENSOR', (1, 128, 28, 28), 'float32'), ('TENSOR', (128, 128, 3, 3), 'float32'), (1, 1), (1, 1, 1, 1), (1, 1), 'float32'), kwargs={}, workload=('conv2d_nchw_winograd.cuda', ('TENSOR', (1, 128, 28, 28), 'float32'), ('TENSOR', (128, 128, 3, 3), 'float32'), (1, 1), (1, 1, 1, 1), (1, 1), 'float32')). A file containing the errors has been written to /tmp/tvm_tuning_errors_j8bamdxe.log.
 Done.

[Task 14/28]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/10) | 0.00 s
[Task 14/28]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (10/10) | 13.52 sWARNING:root:Could not find any valid schedule for task Task(func_name=conv2d_nchw.cuda, args=(('TENSOR', (1, 512, 28, 28), 'float32'), ('TENSOR', (256, 512, 1, 1), 'float32'), (1, 1), (0, 0, 0, 0), (1, 1), 'float32'), kwargs={}, workload=('conv2d_nchw.cuda', ('TENSOR', (1, 512, 28, 28), 'float32'), ('TENSOR', (256, 512, 1, 1), 'float32'), (1, 1), (0, 0, 0, 0), (1, 1), 'float32')). A file containing the errors has been written to /tmp/tvm_tuning_errors_crae7902.log.
 Done.

[Task 15/28]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/10) | 0.00 s Done.

[Task 15/28]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (10/10) | 13.58 sWARNING:root:Could not find any valid schedule for task Task(func_name=conv2d_nchw.cuda, args=(('TENSOR', (1, 256, 28, 28), 'float32'), ('TENSOR', (256, 256, 3, 3), 'float32'), (2, 2), (1, 1, 1, 1), (1, 1), 'float32'), kwargs={}, workload=('conv2d_nchw.cuda', ('TENSOR', (1, 256, 28, 28), 'float32'), ('TENSOR', (256, 256, 3, 3), 'float32'), (2, 2), (1, 1, 1, 1), (1, 1), 'float32')). A file containing the errors has been written to /tmp/tvm_tuning_errors_7_e_pdx1.log.

[Task 16/28]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/10) | 0.00 s
[Task 16/28]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (10/10) | 13.90 sWARNING:root:Could not find any valid schedule for task Task(func_name=conv2d_nchw.cuda, args=(('TENSOR', (1, 512, 28, 28), 'float32'), ('TENSOR', (1024, 512, 1, 1), 'float32'), (2, 2), (0, 0, 0, 0), (1, 1), 'float32'), kwargs={}, workload=('conv2d_nchw.cuda', ('TENSOR', (1, 512, 28, 28), 'float32'), ('TENSOR', (1024, 512, 1, 1), 'float32'), (2, 2), (0, 0, 0, 0), (1, 1), 'float32')). A file containing the errors has been written to /tmp/tvm_tuning_errors_i6kaclzm.log.
 Done.

[Task 17/28]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/10) | 0.00 s
[Task 17/28]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (10/10) | 14.02 sWARNING:root:Could not find any valid schedule for task Task(func_name=conv2d_nchw.cuda, args=(('TENSOR', (1, 256, 14, 14), 'float32'), ('TENSOR', (1024, 256, 1, 1), 'float32'), (1, 1), (0, 0, 0, 0), (1, 1), 'float32'), kwargs={}, workload=('conv2d_nchw.cuda', ('TENSOR', (1, 256, 14, 14), 'float32'), ('TENSOR', (1024, 256, 1, 1), 'float32'), (1, 1), (0, 0, 0, 0), (1, 1), 'float32')). A file containing the errors has been written to /tmp/tvm_tuning_errors_w8f3rgi9.log.
 Done.

[Task 18/28]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/10) | 0.00 s
[Task 18/28]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (10/10) | 8.49 sWARNING:root:Could not find any valid schedule for task Task(func_name=conv2d_nchw.cuda, args=(('TENSOR', (1, 1024, 14, 14), 'float32'), ('TENSOR', (256, 1024, 1, 1), 'float32'), (1, 1), (0, 0, 0, 0), (1, 1), 'float32'), kwargs={}, workload=('conv2d_nchw.cuda', ('TENSOR', (1, 1024, 14, 14), 'float32'), ('TENSOR', (256, 1024, 1, 1), 'float32'), (1, 1), (0, 0, 0, 0), (1, 1), 'float32')). A file containing the errors has been written to /tmp/tvm_tuning_errors_zbsfyik6.log.
 Done.

[Task 19/28]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/10) | 0.00 s
[Task 19/28]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (10/10) | 11.89 sWARNING:root:Could not find any valid schedule for task Task(func_name=conv2d_nchw.cuda, args=(('TENSOR', (1, 256, 14, 14), 'float32'), ('TENSOR', (256, 256, 3, 3), 'float32'), (1, 1), (1, 1, 1, 1), (1, 1), 'float32'), kwargs={}, workload=('conv2d_nchw.cuda', ('TENSOR', (1, 256, 14, 14), 'float32'), ('TENSOR', (256, 256, 3, 3), 'float32'), (1, 1), (1, 1, 1, 1), (1, 1), 'float32')). A file containing the errors has been written to /tmp/tvm_tuning_errors_qu68tp6r.log.
 Done.

[Task 20/28]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/10) | 0.00 s
[Task 20/28]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (10/10) | 11.14 sWARNING:root:Could not find any valid schedule for task Task(func_name=conv2d_nchw_winograd.cuda, args=(('TENSOR', (1, 256, 14, 14), 'float32'), ('TENSOR', (256, 256, 3, 3), 'float32'), (1, 1), (1, 1, 1, 1), (1, 1), 'float32'), kwargs={}, workload=('conv2d_nchw_winograd.cuda', ('TENSOR', (1, 256, 14, 14), 'float32'), ('TENSOR', (256, 256, 3, 3), 'float32'), (1, 1), (1, 1, 1, 1), (1, 1), 'float32')). A file containing the errors has been written to /tmp/tvm_tuning_errors_l15zjzzj.log.
 Done.

[Task 21/28]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/10) | 0.00 s
[Task 21/28]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (10/10) | 4.07 sWARNING:root:Could not find any valid schedule for task Task(func_name=conv2d_nchw.cuda, args=(('TENSOR', (1, 1024, 14, 14), 'float32'), ('TENSOR', (512, 1024, 1, 1), 'float32'), (1, 1), (0, 0, 0, 0), (1, 1), 'float32'), kwargs={}, workload=('conv2d_nchw.cuda', ('TENSOR', (1, 1024, 14, 14), 'float32'), ('TENSOR', (512, 1024, 1, 1), 'float32'), (1, 1), (0, 0, 0, 0), (1, 1), 'float32')). A file containing the errors has been written to /tmp/tvm_tuning_errors_ycqg1qr0.log.
 Done.

[Task 22/28]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/10) | 0.00 s
[Task 22/28]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (10/10) | 10.27 sWARNING:root:Could not find any valid schedule for task Task(func_name=conv2d_nchw.cuda, args=(('TENSOR', (1, 512, 14, 14), 'float32'), ('TENSOR', (512, 512, 3, 3), 'float32'), (2, 2), (1, 1, 1, 1), (1, 1), 'float32'), kwargs={}, workload=('conv2d_nchw.cuda', ('TENSOR', (1, 512, 14, 14), 'float32'), ('TENSOR', (512, 512, 3, 3), 'float32'), (2, 2), (1, 1, 1, 1), (1, 1), 'float32')). A file containing the errors has been written to /tmp/tvm_tuning_errors_ybjo14d5.log.
 Done.

[Task 23/28]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/10) | 0.00 s
[Task 23/28]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (10/10) | 15.11 sWARNING:root:Could not find any valid schedule for task Task(func_name=conv2d_nchw.cuda, args=(('TENSOR', (1, 1024, 14, 14), 'float32'), ('TENSOR', (2048, 1024, 1, 1), 'float32'), (2, 2), (0, 0, 0, 0), (1, 1), 'float32'), kwargs={}, workload=('conv2d_nchw.cuda', ('TENSOR', (1, 1024, 14, 14), 'float32'), ('TENSOR', (2048, 1024, 1, 1), 'float32'), (2, 2), (0, 0, 0, 0), (1, 1), 'float32')). A file containing the errors has been written to /tmp/tvm_tuning_errors_vh0nsr41.log.

[Task 24/28]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/10) | 0.00 s
[Task 24/28]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (10/10) | 3.57 sWARNING:root:Could not find any valid schedule for task Task(func_name=conv2d_nchw.cuda, args=(('TENSOR', (1, 512, 7, 7), 'float32'), ('TENSOR', (2048, 512, 1, 1), 'float32'), (1, 1), (0, 0, 0, 0), (1, 1), 'float32'), kwargs={}, workload=('conv2d_nchw.cuda', ('TENSOR', (1, 512, 7, 7), 'float32'), ('TENSOR', (2048, 512, 1, 1), 'float32'), (1, 1), (0, 0, 0, 0), (1, 1), 'float32')). A file containing the errors has been written to /tmp/tvm_tuning_errors_cjrlba0_.log.
 Done.

[Task 25/28]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/10) | 0.00 s
[Task 25/28]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (10/10) | 10.34 sWARNING:root:Could not find any valid schedule for task Task(func_name=conv2d_nchw.cuda, args=(('TENSOR', (1, 2048, 7, 7), 'float32'), ('TENSOR', (512, 2048, 1, 1), 'float32'), (1, 1), (0, 0, 0, 0), (1, 1), 'float32'), kwargs={}, workload=('conv2d_nchw.cuda', ('TENSOR', (1, 2048, 7, 7), 'float32'), ('TENSOR', (512, 2048, 1, 1), 'float32'), (1, 1), (0, 0, 0, 0), (1, 1), 'float32')). A file containing the errors has been written to /tmp/tvm_tuning_errors_fqcr_b68.log.
 Done.

[Task 26/28]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/10) | 0.00 s Done.

[Task 26/28]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (10/10) | 13.92 sWARNING:root:Could not find any valid schedule for task Task(func_name=conv2d_nchw.cuda, args=(('TENSOR', (1, 512, 7, 7), 'float32'), ('TENSOR', (512, 512, 3, 3), 'float32'), (1, 1), (1, 1, 1, 1), (1, 1), 'float32'), kwargs={}, workload=('conv2d_nchw.cuda', ('TENSOR', (1, 512, 7, 7), 'float32'), ('TENSOR', (512, 512, 3, 3), 'float32'), (1, 1), (1, 1, 1, 1), (1, 1), 'float32')). A file containing the errors has been written to /tmp/tvm_tuning_errors_9ksvboer.log.

[Task 27/28]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/10) | 0.00 s
[Task 27/28]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (10/10) | 12.91 sWARNING:root:Could not find any valid schedule for task Task(func_name=conv2d_nchw_winograd.cuda, args=(('TENSOR', (1, 512, 7, 7), 'float32'), ('TENSOR', (512, 512, 3, 3), 'float32'), (1, 1), (1, 1, 1, 1), (1, 1), 'float32'), kwargs={}, workload=('conv2d_nchw_winograd.cuda', ('TENSOR', (1, 512, 7, 7), 'float32'), ('TENSOR', (512, 512, 3, 3), 'float32'), (1, 1), (1, 1, 1, 1), (1, 1), 'float32')). A file containing the errors has been written to /tmp/tvm_tuning_errors_dear1xqt.log.
 Done.

[Task 28/28]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/10) | 0.00 s
[Task 28/28]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (10/10) | 8.51 sWARNING:root:Could not find any valid schedule for task Task(func_name=dense_small_batch.gpu, args=(('TENSOR', (1, 2048), 'float32'), ('TENSOR', (1000, 2048), 'float32'), None, 'float32'), kwargs={}, workload=('dense_small_batch.gpu', ('TENSOR', (1, 2048), 'float32'), ('TENSOR', (1000, 2048), 'float32'), None, 'float32')). A file containing the errors has been written to /tmp/tvm_tuning_errors_rmg3_378.log.
 Done.
 Done.
 Done.
class='n02123045 tabby, tabby cat' with probability=0.610552
class='n02123159 tiger cat' with probability=0.367179
class='n02124075 Egyptian cat' with probability=0.019365
class='n02129604 tiger, Panthera tigris' with probability=0.001273
class='n04040759 radiator' with probability=0.000261
optimized: {'mean': 18.921251980018496, 'median': 21.12027845000739, 'std': 6.681149567023143}
unoptimized: {'mean': 6.638219979995483, 'median': 7.205596800031344, 'std': 2.198990796372162}

Issue

While with AVX-512 the optimized model runs slightly slower than the unoptimized one, the autotuning seems to work.

With CUDA though, the output does not show any measured speed in GFLOPS and outputs these warnings:

WARNING:root:Could not find any valid schedule for task Task(func_name=dense_small_batch.gpu, args=(('TENSOR', (1, 2048), 'float32'), ('TENSOR', (1000, 2048), 'float32'), None, 'float32'), kwargs={}, workload=('dense_small_batch.gpu', ('TENSOR', (1, 2048), 'float32'), ('TENSOR', (1000, 2048), 'float32'), None, 'float32')). A file containing the errors has been written to /tmp/tvm_tuning_errors_rmg3_378.log.

According to the log, the issue stems from

  File "/home/mkroening/Development/tvm/python/tvm/autotvm/measure/measure_methods.py", line 814, in verify_pass
    raise InstantiationError("Skipped because of invalid gpu kernel")
tvm.autotvm.task.space.InstantiationError: Skipped because of invalid gpu kernel
/tmp/tvm_tuning_errors_dear1xqt.log
Traceback (most recent call last):
  50: 0xffffffffffffffff
  49: _start
  48: __libc_start_main
  47: Py_BytesMain
  46: Py_RunMain
  45: 0x00000000006b6fa1
  44: PyObject_Call
  43: _PyFunction_Vectorcall
  42: _PyEval_EvalCodeWithName
  41: _PyEval_EvalFrameDefault
  40: _PyFunction_Vectorcall
  39: _PyEval_EvalCodeWithName
  38: _PyEval_EvalFrameDefault
  37: 0x00000000005c552f
  36: 0x0000000000600f53
  35: PyEval_EvalCode
  34: _PyEval_EvalCodeWithName
  33: _PyEval_EvalFrameDefault
  32: _PyFunction_Vectorcall
  31: _PyEval_EvalCodeWithName
  30: _PyEval_EvalFrameDefault
  29: PyObject_Call
  28: _PyFunction_Vectorcall
  27: _PyEval_EvalCodeWithName
  26: _PyEval_EvalFrameDefault
  25: 0x000000000050ad7b
  24: _PyFunction_Vectorcall
  23: _PyEval_EvalFrameDefault
  22: 0x0000000000504f8c
  21: 0x0000000000614e8c
  20: 0x0000000000614d12
  19: 0x00000000005004f7
  18: _PyEval_EvalFrameDefault
  17: _PyFunction_Vectorcall
  16: _PyEval_EvalFrameDefault
  15: _PyFunction_Vectorcall
  14: _PyEval_EvalFrame
Traceback (most recent call last):
  9: TVMFuncCall
  8: std::_Function_handler<void (tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*), tvm::runtime::TypedPackedFunc<tvm::IRModule (tvm::te::Schedule, tvm::runtime::Array<tvm::runtime::ObjectRef, void> const&, tvm::runtime::String const&, tvm::runtime::Map<tvm::te::Tensor, tvm::tir::Buffer, void, void> const&, bool)>::AssignTypedLambda<tvm::$_5>(tvm::$_5, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >)::{lambda(tvm::runtime::TVMArgs const&, tvm::runtime::TVMRetValue*)#1}>::_M_invoke(std::_Any_data const&, tvm::runtime::TVMArgs&&, tvm::runtime::TVMRetValue*&&)
  7: tvm::LowerSchedule(tvm::te::Schedule, tvm::runtime::Array<tvm::runtime::ObjectRef, void> const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::unordered_map<tvm::te::Tensor, tvm::tir::Buffer, std::hash<tvm::te::Tensor>, std::equal_to<tvm::te::Tensor>, std::allocator<std::pair<tvm::te::Tensor const, tvm::tir::Buffer> > > const&, bool)
  6: tvm::LowerWithPassList(tvm::IRModule, tvm::runtime::Array<tvm::transform::Pass, void>)
  5: tvm::transform::Pass::operator()(tvm::IRModule) const
  4: tvm::transform::Pass::operator()(tvm::IRModule, tvm::transform::PassContext const&) const
  3: tvm::transform::SequentialNode::operator()(tvm::IRModule, tvm::transform::PassContext const&) const
  2: tvm::transform::Pass::operator()(tvm::IRModule, tvm::transform::PassContext const&) const
  1: tvm::tir::transform::PrimFuncPassNode::operator()(tvm::IRModule, tvm::transform::PassContext const&) const
  0: std::_Function_handler<void (tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*), TVMFuncCreateFromCFunc::$_2>::_M_invoke(std::_Any_data const&, tvm::runtime::TVMArgs&&, tvm::runtime::TVMRetValue*&&)
  File "tvm/_ffi/_cython/./packed_func.pxi", line 56, in tvm._ffi._cy3.core.tvm_callback
  File "/home/mkroening/Development/tvm/python/tvm/autotvm/measure/measure_methods.py", line 814, in verify_pass
    raise InstantiationError("Skipped because of invalid gpu kernel")
tvm.autotvm.task.space.InstantiationError: Skipped because of invalid gpu kernel
Traceback (most recent call last):
  50: 0xffffffffffffffff
  49: _start
  48: __libc_start_main
  47: Py_BytesMain
  46: Py_RunMain
  45: 0x00000000006b6fa1
  44: PyObject_Call
  43: _PyFunction_Vectorcall
  42: _PyEval_EvalCodeWithName
  41: _PyEval_EvalFrameDefault
  40: _PyFunction_Vectorcall
  39: _PyEval_EvalCodeWithName
  38: _PyEval_EvalFrameDefault
  37: 0x00000000005c552f
  36: 0x0000000000600f53
  35: PyEval_EvalCode
  34: _PyEval_EvalCodeWithName
  33: _PyEval_EvalFrameDefault
  32: _PyFunction_Vectorcall
  31: _PyEval_EvalCodeWithName
  30: _PyEval_EvalFrameDefault
  29: PyObject_Call
  28: _PyFunction_Vectorcall
  27: _PyEval_EvalCodeWithName
  26: _PyEval_EvalFrameDefault
  25: 0x000000000050ad7b
  24: _PyFunction_Vectorcall
  23: _PyEval_EvalFrameDefault
  22: 0x0000000000504f8c
  21: 0x0000000000614e8c
  20: 0x0000000000614d12
  19: 0x00000000005004f7
  18: _PyEval_EvalFrameDefault
  17: _PyFunction_Vectorcall
  16: _PyEval_EvalFrameDefault
  15: _PyFunction_Vectorcall
  14: _PyEval_EvalFrame
Traceback (most recent call last):
  9: TVMFuncCall
  8: std::_Function_handler<void (tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*), tvm::runtime::TypedPackedFunc<tvm::IRModule (tvm::te::Schedule, tvm::runtime::Array<tvm::runtime::ObjectRef, void> const&, tvm::runtime::String const&, tvm::runtime::Map<tvm::te::Tensor, tvm::tir::Buffer, void, void> const&, bool)>::AssignTypedLambda<tvm::$_5>(tvm::$_5, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >)::{lambda(tvm::runtime::TVMArgs const&, tvm::runtime::TVMRetValue*)#1}>::_M_invoke(std::_Any_data const&, tvm::runtime::TVMArgs&&, tvm::runtime::TVMRetValue*&&)
  7: tvm::LowerSchedule(tvm::te::Schedule, tvm::runtime::Array<tvm::runtime::ObjectRef, void> const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::unordered_map<tvm::te::Tensor, tvm::tir::Buffer, std::hash<tvm::te::Tensor>, std::equal_to<tvm::te::Tensor>, std::allocator<std::pair<tvm::te::Tensor const, tvm::tir::Buffer> > > const&, bool)
  6: tvm::LowerWithPassList(tvm::IRModule, tvm::runtime::Array<tvm::transform::Pass, void>)
  5: tvm::transform::Pass::operator()(tvm::IRModule) const
  4: tvm::transform::Pass::operator()(tvm::IRModule, tvm::transform::PassContext const&) const
  3: tvm::transform::SequentialNode::operator()(tvm::IRModule, tvm::transform::PassContext const&) const
  2: tvm::transform::Pass::operator()(tvm::IRModule, tvm::transform::PassContext const&) const
  1: tvm::tir::transform::PrimFuncPassNode::operator()(tvm::IRModule, tvm::transform::PassContext const&) const
  0: std::_Function_handler<void (tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*), TVMFuncCreateFromCFunc::$_2>::_M_invoke(std::_Any_data const&, tvm::runtime::TVMArgs&&, tvm::runtime::TVMRetValue*&&)
  File "tvm/_ffi/_cython/./packed_func.pxi", line 56, in tvm._ffi._cy3.core.tvm_callback
  File "/home/mkroening/Development/tvm/python/tvm/autotvm/measure/measure_methods.py", line 814, in verify_pass
    raise InstantiationError("Skipped because of invalid gpu kernel")
tvm.autotvm.task.space.InstantiationError: Skipped because of invalid gpu kernel
Traceback (most recent call last):
  50: 0xffffffffffffffff
  49: _start
  48: __libc_start_main
  47: Py_BytesMain
  46: Py_RunMain
  45: 0x00000000006b6fa1
  44: PyObject_Call
  43: _PyFunction_Vectorcall
  42: _PyEval_EvalCodeWithName
  41: _PyEval_EvalFrameDefault
  40: _PyFunction_Vectorcall
  39: _PyEval_EvalCodeWithName
  38: _PyEval_EvalFrameDefault
  37: 0x00000000005c552f
  36: 0x0000000000600f53
  35: PyEval_EvalCode
  34: _PyEval_EvalCodeWithName
  33: _PyEval_EvalFrameDefault
  32: _PyFunction_Vectorcall
  31: _PyEval_EvalCodeWithName
  30: _PyEval_EvalFrameDefault
  29: PyObject_Call
  28: _PyFunction_Vectorcall
  27: _PyEval_EvalCodeWithName
  26: _PyEval_EvalFrameDefault
  25: 0x000000000050ad7b
  24: _PyFunction_Vectorcall
  23: _PyEval_EvalFrameDefault
  22: 0x0000000000504f8c
  21: 0x0000000000614e8c
  20: 0x0000000000614d12
  19: 0x00000000005004f7
  18: _PyEval_EvalFrameDefault
  17: _PyFunction_Vectorcall
  16: _PyEval_EvalFrameDefault
  15: _PyFunction_Vectorcall
  14: _PyEval_EvalFrame
Traceback (most recent call last):
  50: 0xffffffffffffffff
  49: _start
  48: __libc_start_main
  47: Py_BytesMain
  46: Py_RunMain
  45: 0x00000000006b6fa1
  44: PyObject_Call
  43: _PyFunction_Vectorcall
  42: _PyEval_EvalCodeWithName
  41: _PyEval_EvalFrameDefault
  40: _PyFunction_Vectorcall
  39: _PyEval_EvalCodeWithName
  38: _PyEval_EvalFrameDefault
  37: 0x00000000005c552f
  36: 0x0000000000600f53
  35: PyEval_EvalCode
  34: _PyEval_EvalCodeWithName
  33: _PyEval_EvalFrameDefault
  32: _PyFunction_Vectorcall
  31: _PyEval_EvalCodeWithName
  30: _PyEval_EvalFrameDefault
  29: PyObject_Call
  28: _PyFunction_Vectorcall
  27: _PyEval_EvalCodeWithName
  26: _PyEval_EvalFrameDefault
  25: 0x000000000050ad7b
  24: _PyFunction_Vectorcall
  23: _PyEval_EvalFrameDefault
  22: 0x0000000000504f8c
  21: 0x0000000000614e8c
  20: 0x0000000000614d12
  19: 0x00000000005004f7
  18: _PyEval_EvalFrameDefault
  17: _PyFunction_Vectorcall
  16: _PyEval_EvalFrameDefault
  15: _PyFunction_Vectorcall
  14: _PyEval_EvalFrame
Traceback (most recent call last):
  9: TVMFuncCall
  8: std::_Function_handler<void (tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*), tvm::runtime::TypedPackedFunc<tvm::IRModule (tvm::te::Schedule, tvm::runtime::Array<tvm::runtime::ObjectRef, void> const&, tvm::runtime::String const&, tvm::runtime::Map<tvm::te::Tensor, tvm::tir::Buffer, void, void> const&, bool)>::AssignTypedLambda<tvm::$_5>(tvm::$_5, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >)::{lambda(tvm::runtime::TVMArgs const&, tvm::runtime::TVMRetValue*)#1}>::_M_invoke(std::_Any_data const&, tvm::runtime::TVMArgs&&, tvm::runtime::TVMRetValue*&&)
  7: tvm::LowerSchedule(tvm::te::Schedule, tvm::runtime::Array<tvm::runtime::ObjectRef, void> const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::unordered_map<tvm::te::Tensor, tvm::tir::Buffer, std::hash<tvm::te::Tensor>, std::equal_to<tvm::te::Tensor>, std::allocator<std::pair<tvm::te::Tensor const, tvm::tir::Buffer> > > const&, bool)
  6: tvm::LowerWithPassList(tvm::IRModule, tvm::runtime::Array<tvm::transform::Pass, void>)
  5: tvm::transform::Pass::operator()(tvm::IRModule) const
  4: tvm::transform::Pass::operator()(tvm::IRModule, tvm::transform::PassContext const&) const
  3: tvm::transform::SequentialNode::operator()(tvm::IRModule, tvm::transform::PassContext const&) const
  2: tvm::transform::Pass::operator()(tvm::IRModule, tvm::transform::PassContext const&) const
  1: tvm::tir::transform::PrimFuncPassNode::operator()(tvm::IRModule, tvm::transform::PassContext const&) const
  0: std::_Function_handler<void (tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*), TVMFuncCreateFromCFunc::$_2>::_M_invoke(std::_Any_data const&, tvm::runtime::TVMArgs&&, tvm::runtime::TVMRetValue*&&)
  File "tvm/_ffi/_cython/./packed_func.pxi", line 56, in tvm._ffi._cy3.core.tvm_callback
  File "/home/mkroening/Development/tvm/python/tvm/autotvm/measure/measure_methods.py", line 814, in verify_pass
    raise InstantiationError("Skipped because of invalid gpu kernel")
tvm.autotvm.task.space.InstantiationError: Skipped because of invalid gpu kernel
Traceback (most recent call last):
  50: 0xffffffffffffffff
  49: _start
  48: __libc_start_main
  47: Py_BytesMain
  46: Py_RunMain
  45: 0x00000000006b6fa1
  44: PyObject_Call
  43: _PyFunction_Vectorcall
  42: _PyEval_EvalCodeWithName
  41: _PyEval_EvalFrameDefault
  40: _PyFunction_Vectorcall
  39: _PyEval_EvalCodeWithName
  38: _PyEval_EvalFrameDefault
  37: 0x00000000005c552f
  36: 0x0000000000600f53
  35: PyEval_EvalCode
  34: _PyEval_EvalCodeWithName
  33: _PyEval_EvalFrameDefault
  32: _PyFunction_Vectorcall
  31: _PyEval_EvalCodeWithName
  30: _PyEval_EvalFrameDefault
  29: PyObject_Call
  28: _PyFunction_Vectorcall
  27: _PyEval_EvalCodeWithName
  26: _PyEval_EvalFrameDefault
  25: 0x000000000050ad7b
  24: _PyFunction_Vectorcall
  23: _PyEval_EvalFrameDefault
  22: 0x0000000000504f8c
  21: 0x0000000000614e8c
  20: 0x0000000000614d12
  19: 0x00000000005004f7
  18: _PyEval_EvalFrameDefault
  17: _PyFunction_Vectorcall
  16: _PyEval_EvalFrameDefault
  15: _PyFunction_Vectorcall
  14: _PyEval_EvalFrame
Traceback (most recent call last):
  50: 0xffffffffffffffff
  49: _start
  48: __libc_start_main
  47: Py_BytesMain
  46: Py_RunMain
  45: 0x00000000006b6fa1
  44: PyObject_Call
  43: _PyFunction_Vectorcall
  42: _PyEval_EvalCodeWithName
  41: _PyEval_EvalFrameDefault
  40: _PyFunction_Vectorcall
  39: _PyEval_EvalCodeWithName
  38: _PyEval_EvalFrameDefault
  37: 0x00000000005c552f
  36: 0x0000000000600f53
  35: PyEval_EvalCode
  34: _PyEval_EvalCodeWithName
  33: _PyEval_EvalFrameDefault
  32: _PyFunction_Vectorcall
  31: _PyEval_EvalCodeWithName
  30: _PyEval_EvalFrameDefault
  29: PyObject_Call
  28: _PyFunction_Vectorcall
  27: _PyEval_EvalCodeWithName
  26: _PyEval_EvalFrameDefault
  25: 0x000000000050ad7b
  24: _PyFunction_Vectorcall
  23: _PyEval_EvalFrameDefault
  22: 0x0000000000504f8c
  21: 0x0000000000614e8c
  20: 0x0000000000614d12
  19: 0x00000000005004f7
  18: _PyEval_EvalFrameDefault
  17: _PyFunction_Vectorcall
  16: _PyEval_EvalFrameDefault
  15: _PyFunction_Vectorcall
  14: _PyEval_EvalFrame
Traceback (most recent call last):
  50: 0xffffffffffffffff
  49: _start
  48: __libc_start_main
  47: Py_BytesMain
  46: Py_RunMain
  45: 0x00000000006b6fa1
  44: PyObject_Call
  43: _PyFunction_Vectorcall
  42: _PyEval_EvalCodeWithName
  41: _PyEval_EvalFrameDefault
  40: _PyFunction_Vectorcall
  39: _PyEval_EvalCodeWithName
  38: _PyEval_EvalFrameDefault
  37: 0x00000000005c552f
  36: 0x0000000000600f53
  35: PyEval_EvalCode
  34: _PyEval_EvalCodeWithName
  33: _PyEval_EvalFrameDefault
  32: _PyFunction_Vectorcall
  31: _PyEval_EvalCodeWithName
  30: _PyEval_EvalFrameDefault
  29: PyObject_Call
  28: _PyFunction_Vectorcall
  27: _PyEval_EvalCodeWithName
  26: _PyEval_EvalFrameDefault
  25: 0x000000000050ad7b
  24: _PyFunction_Vectorcall
  23: _PyEval_EvalFrameDefault
  22: 0x0000000000504f8c
  21: 0x0000000000614e8c
  20: 0x0000000000614d12
  19: 0x00000000005004f7
  18: _PyEval_EvalFrameDefault
  17: _PyFunction_Vectorcall
  16: _PyEval_EvalFrameDefault
  15: _PyFunction_Vectorcall
  14: _PyEval_EvalFrame⏎

Is there something obviously wrong with my setup? Is the GPU kernel invalid because the setup in the tutorial is inherently incompatible with GPUs, or is this an TVM-internal issue?

Hi I have meet the same issues aforementioned by @sugartom and @mkroening . And I tried to log out some detail error message at [this](https://github.com/apache/tvm/blob/main/src/tir/analysis/verify_gpu_code.cc#L288) using [those](https://github.com/apache/tvm/blob/main/src/tir/analysis/verify_gpu_code.cc#L301)

and detail logs are something like:

Used shared memory per block (425472) is greater than the allowed maximum (49152)
Extent of threadIdx.z (128) is greater than maximum allowed (64);
Used threads per block (2048) is greater than the allowed maximum (1024)

My GPU is 2080Ti with “NVIDIA-SMI 430.40 Driver Version: 430.40 CUDA Version: 10.1” From thoses logs I guess maybe tvm firstly generate task which are all invalid and it causes the early drop so tvm uses some unoptimized schedule to run tasks…could you help us to look deeper into this issue? @AndrewZhaoLuo