[MetaSchedule] Tuning with Winograd Conv2d

When I use MS to tune topi’s winograd_conv2d_nhwc op, it could never found a valid schedule. The output is like this:

 ID | Name |       FLOP | Weight | Speed (GFLOPS) | Latency (us) | Weighted Latency (us) | Trials | Done
---------------------------------------------------------------------------------------------------------
  0 | main | 2348810240 |      1 |            N/A |          N/A |                   N/A |    512 |
---------------------------------------------------------------------------------------------------------

I use the latest TVM 7eb45e commit.

Another issue is that if I tune the winograd_conv2d without pre-computing the kernel, i.e., with “HWIO” kernel layout, I got the following error:

  4: tvm::meta_schedule::TaskSchedulerNode::Tune(tvm::runtime::Array<tvm::meta_schedule::TuneContext, void>, tvm::runtime::Array<tvm::FloatImm, void>, int, int, int, tvm::meta_schedule::Builder, tvm::meta_schedule::Runner, tvm::runtime::Array<tvm::meta_schedule::MeasureCallback, void>, tvm::runtime::Optional<tvm::meta_schedule::Database>, tvm::runtime::Optional<tvm::meta_schedule::CostModel>)
  3: tvm::meta_schedule::PostOrderApplyNode::GenerateDesignSpace(tvm::IRModule const&)
  2: tvm::meta_schedule::CrossThreadReductionNode::Apply(tvm::tir::Schedule const&, tvm::tir::BlockRV const&)
  1: tvm::meta_schedule::CrossThreadReductionNode::GetThreadIdxExtentFromTrace(tvm::tir::Trace const&)
  0: tvm::runtime::Array<tvm::runtime::ObjectRef, void>::operator[](long) const
  File "/mnt/source/blade_new_gemm/tvm/include/tvm/runtime/container/array.h", line 414
InternalError: Check failed: (0 <= i && i < p->size_) is false: IndexError: indexing 3 on an array of size 3

I am sure that the layout and the TE compute code are right as I verified them by building and executing them directly on LLVM target.

Also, the conv shapes are like follows:

N, H, W, C = 1, 64, 64, 320
O, h, w = 320, 3, 3
P, S = 1, 1
dev = tvm.cuda()
dtype = "float16"
use_winograd = False
pre_computed = False

data = te.placeholder((N, H, W, C), dtype=dtype, name="inputs")
if not use_winograd:
    kernel_shape = (O, h, w, C)
else:
    if pre_computed:
        tile_size = 4
        kernel_shape = (h - 1 + tile_size, w - 1 + tile_size, C, O)
    else:
        kernel_shape = (h, w, C, O)
kernel = te.placeholder(kernel_shape, dtype=dtype, name="weights")
bias = te.placeholder((O, ), dtype=dtype, name="bias")

I also meet problem on metaschedule tuning with conv2d_winograd, do you success?