Auto scheduler time out & no valid schedule for GEMM

I am trying to run the tutorial code for auto-scheduler for GEMM. However, I encounter the following problem during the auto scheduler search process.

Placeholder: A, B
parallel i.0@j.0@ (0,48)
  for i.1 (0,739)
    for k.0 (0,4576)
      for j.2 (0,22)
        for k.1 (0,2)
          for j.3 (0,26)
            matmul = ...
  for i.1 (0,739)
    for j.1 (0,572)
      out = ...

No: 504 GFLOPS: 0.00 / 0.00     results: MeasureResult(error_type:RuntimeDeviceError, error_msg:Traceback (most
recent call last):
  File "/some_path/apache-tvm-src-v0.8.0.rc0/python/tvm/auto_scheduler/utils.py", line 295, in call_f
unc_with_timeout
    res = worker.recv()
  File "/some_path/apache-tvm-src-v0.8.0.rc0/python/tvm/contrib/popen_pool.py", line 275, in recv
    raise TimeoutError()
TimeoutError
, all_cost:10.29, Tstamp:1646717264.28)

Moreover, I encounter the following error when I want to test the latency of best schedule.

Traceback (most recent call last):
  File "auto_scheduler_matmul.py", line 139, in <module>
    sch, args = task.apply_best(log_file)
  File "/some_path/apache-tvm-src-v0.8.0.rc0/python/tvm/auto_scheduler/search_task.py", line 521, in apply_best
    raise RuntimeError(
RuntimeError: Cannot find any valid schedule for ["matmul", 2217, 9152, 9152, "float32"] in file ./search_log/matmul_M_2217_K_9152_N_9152.json

Here is the code (similar to what tutorial does) I modify the code a little bit, but I think the overall logic is the same.

import os
import numpy as np
import tvm
from tvm import te, auto_scheduler

@auto_scheduler.register_workload  # Note the auto_scheduler decorator
def matmul(M, K, N, dtype):
    A = te.placeholder((M, K), name="A", dtype=dtype)
    B = te.placeholder((K, N), name="B", dtype=dtype)
    k = te.reduce_axis((0, K), name="k")
    matmul = te.compute(
        (M, N),
        lambda i, j: te.sum(A[i, k] * B[k, j], axis=k),
        name="matmul",
        attrs    ={"layout_free_placeholders": [B]},  # enable automatic layout transform for tensor B
    )
    out = te.compute((M, N), lambda i, j: matmul[i, j] , name="out")

    return [A, B, out]
target = tvm.target.Target("llvm -mcpu=core-avx2")
M = 2217
K = 9152
N = 9152
task = tvm.auto_scheduler.SearchTask(func=matmul, args=(M, K, N, "float32"), target=target)
log_file = "./search_log/matmul"+"_M_"+str(M)+"_K_"+str(K)+"_N_"+str(N)+".json"
tune_option = auto_scheduler.TuningOptions(
                    num_measure_trials=1000,
                    measure_callbacks=[auto_scheduler.RecordToFile(log_file)],
                    verbose=0,
                )

 # Run auto-tuning (search)
task.tune(tune_option)
 # Apply the best schedule
sch, args = task.apply_best(log_file)
func = tvm.build(sch, args, target)
a_np = np.random.uniform(size=(M, K)).astype(np.float32)
b_np = np.random.uniform(size=(K, N)).astype(np.float32)
sparsity=0.8
a_np[a_np<sparsity]=0
b_np[b_np<sparsity]=0
out_np = a_np.dot(b_np)

dev = tvm.cpu()
a_tvm = tvm.nd.array(a_np, device=dev)
b_tvm = tvm.nd.array(b_np, device=dev)
out_tvm = tvm.nd.empty(out_np.shape, device=dev)
func(a_tvm, b_tvm, out_tvm)

# Check results

np.testing.assert_allclose(out_np, out_tvm.numpy(), rtol=1e-3)
print("check result done")
# Evaluate execution time.
evaluator = func.time_evaluator(func.entry_name, dev, min_repeat_ms=500)
print("Execution time of this operator: %.3f ms"% (np.median(evaluator(a_tvm, b_tvm, out_tvm).results) * 1000) )

I am pretty confused on why this problem shows up. There are similar posts on this problem before, but I don’t think those answers can solve the issue here.

I am running the schedule search target intel CPU. I build TVM using v0.8, with llvm 13.0, on ubuntu 18.04.

Another question I am curious about: Is the time report by TVM evaluator precise? Will the Python interface influence the latency? Or, the latency is measured directly based on the low-level code? Will the auto-tuning process generate some synthetic matrix to collect hardware performance data?

Any help is appreciated. Thanks in advance!

Since your shape is pretty large and the target is CPU with just avx2, timeout seems possible to me. The default timeout, including lower, compile and measurement, is 10 seconds. You could try to use a smaller shape for testing. If that works, meaning the flow is working, then you could just enlarge the timeout.

The second error is related to the first one. Since you didn’t get any valid schedule during tuning, there’s no best schedule in your tuning log (only valid schedules are considered).

Finally, the time reported by TVM is accurate, because we measure the latency in C++ and it’s nothing related to FFI. Besides, auto-scheduling process won’t generate synthetic matrix to collect hardware performance data. It directly uses the tuning log as the training data set to train the cost model.

1 Like

Thanks! The code works for a smaller matrix. (e.g. M=N=K=986) So, I think the flow works.

Could you please give some link to tutorial or instructions on setting the timeout parameter? I checked the documentation for the function I used, but I do not find a way to set this parameter. The only function related to timeout is called local builder, but I am not sure how this function should use.

Also, I have some follow up questions:

For the CPU target, what is the best flag(s) I should pass to get the best performance? Where can I find all supported flag? Here is something I tried

(1) My processor support AVX512. So, I tried to pass skylake-avx512 and avx2,(e.g."llvm -mcpu=skylake-avx512,avx2") but tvm generates error message. Does that mean tvm only support one option a time? If that is the case, I think passing with llvm -mcpu=skylake-avx512 will generally give a better performance compared with llvm -mcpu=avx2.

(2)Also, I looked up the doc link here. It says the -mcpu= will automatically detect the architecture. So, will target = tvm.target.Target("llvm") give better performance than llvm -mcpu=skylake-avx512? I tried to set target = tvm.target.Target("llvm") and check property of mcpu by print(target.mcpu), but it will print nothing. This result seems contradict with what documentation says.

Based on things I tried, does that mean the llvm -mcpu=skylake-avx512 will give best performance data on CPU?

For the hardware performance data, based on my understanding, the search algorithm will propose some new schedules, and TVM still needs to run those new schedule on the actual hardware, which become tuning log that construct the model and used for search later. So, I am trying to check: the tuning log/training data collection still need to generate some synthetic matrix to collect performance data. Does it correct?

For GPU runtime measurement reported by TVM, it does not include the memory transfer time between the host to device. That is, the time reported is purely execution time on GPU, assuming data is already on GPU’s memory. Am I correct?

It is a long post. So, I really appreciate your time!

(1) I don’t think you need both if you have AVX512. If your CPU supports AVX512, then you should definitely use it along.

(2) I’m not sure how that works. For me I always specify the most accurate target as I could. llvm -mcpu=skylake-avx512 is the one I used a lot for AWS c5 instances.

(3) You understanding to the search algorithm is correct. However, as I have said, we DON’T need to generate synthetic matrix. If no enough tuning log is available for training a cost model, we just random select candidates.

(4) Correct, since all tensors are already on GPU DRAMs before kicking off the execution, it’s unnecessary to include memory transfer time between host and GPU.

Thanks!

For (3), I understand that the search process does not need to generate synthetic matrix. I think I may not express my question clearly and I try to re-phrase my question as follow: How the measurement of latency is done for a given new schedule (one entry in the tuning log)? (not the search process) In my understanding, to get a measurement of a schedule, we need to run the actual code generated by the schedule on some data. So, during the measurement, I think it needs to generate some synthetic matrix based on the matrix dimension in order to execute the code. Is it correct?

Thanks for your time!

Ok then it is just the terminology mismatch. Yes in order to evaluate a schedule candidate, we lower the compute along with the schedule, generate and compile the binary code, and measure the execution latency using the randomly generated synthetic data. In the case of tuning sparse workloads with data sensitive performance, we allow users to provide input data for measurements.

Thanks!

You mentioned “In the case of tuning sparse workloads with data sensitive performance, we allow users to provide input data for measurements.” Could you please provide a link/tutorial on how can I supply particular data for the measurements?

https://tvm.apache.org/docs/how_to/tune_with_autoscheduler/tune_sparse_x86.html#special-step-for-sparse-workload

Thank you so much! I will post follow-up if I encounter bugs in the future.

Hi, I am trying to run the auto scheduling for Sparse Dense matrix multiplication. However, I run into troubles & following errors.

In general, I follow the tutorial you mentioned. I try to supply the sparse matrix data for measurement during the tuning. I use CSR format to represent the sparse matrix and define the computation rule (not BSR format used in the tutorial). On the other hand, I don’t specify the custom sketch rule. From my opinion, the only significant difference between my implementation and the tutorial is that I did not supply a custom sketch rule. So, I am speculating that this can be the possible source of error.

So, here is my questions

(1) How should I solve this error? Does it relate to missing custom sketches? If yes, does that mean the only studying material of custom sketch is in the tutorial you mentioned?

(2) Given the custom sketch rule is hard to write well and sparse matrix multiplication is an important common operator, I am wondering is it possible that we do auto scheduling for Sparse matrix multiplication without a custom sketch rule. That is, the only difference between Spare Dense Matrix Multiplication and Dense GEMM is the definition of compute and the data format we supplied and the overall flow is similar to this tutorial.

(3) The TOPI is used in the tutorial. So, I also used it in the compute rule definition. However, base on your answer to this Q&A, I though TOPI is used as templates for autoTVM to tune the schedule, instead of used in auto scheduling. So, can I think TOPI is aiming to reduce the effort in wiring low level tensor expression, but we still need to write the autoTVM/auto scheduler to start searching process?

I attached my code and error message below.

Here is my code:

import tvm
import os
import numpy as np
import tvm.testing
from tvm import te, auto_scheduler, runtime, topi
from tvm.auto_scheduler import _ffi_api
from tvm.topi.utils import get_const_tuple
from tvm.topi.sparse.utils import random_bsr_matrix
import scipy.sparse as sp

# define the compute rule
@auto_scheduler.register_workload  # Note the auto_scheduler decorator
def sparse_dense_gemm(M, K, N, w_data_shape, w_indices_shape, w_indptr_shape, dtype):
    X = te.placeholder((M, K), name="X", dtype=dtype)
    W_data = te.placeholder(shape=w_data_shape, dtype=dtype)
    W_indices = te.placeholder(shape=w_indices_shape, dtype="int32")
    W_indptr = te.placeholder(shape=w_indptr_shape, dtype="int32")
    out = topi.nn.sparse_dense(X, W_data, W_indices, W_indptr)
    return [X, W_data, W_indices, W_indptr, out]

M=986 
N=986 
K=986
sparsity=0.8
target = tvm.target.Target("llvm -mcpu=skylake-avx512")
X_np = np.random.randn(M, K).astype("float32")
W_sp_np = sp.random(N, K, density=sparsity, format="csr", dtype="float32")
W_np = W_sp_np.todense()
Y_np = X_np @ W_np.T    # Process the matrix multiplication

prefix = "sparse_dense_csr_%d_%d_%d_%d_" % (
    N,
    K,
    W_sp_np.indices.shape[0],
    W_sp_np.indptr.shape[0],
)
task = tvm.auto_scheduler.SearchTask(func=sparse_dense_gemm,
    args=(M, K, N, W_sp_np.data.shape, W_sp_np.indices.shape, W_sp_np.indptr.shape, "float32"), 
    target=target,
    task_inputs={
        prefix + "W_data": runtime.ndarray.array(W_sp_np.data),
        prefix + "W_indices": runtime.ndarray.array(W_sp_np.indices),
        prefix + "W_indptr": runtime.ndarray.array(W_sp_np.indptr),
    },
        task_inputs_save_to_file=True,)

# Inspect the computational graph
print("Computational DAG:")
print(task.compute_dag)
log_file="test.json"
tune_option = auto_scheduler.TuningOptions(
    num_measure_trials=1000,
    measure_callbacks=[auto_scheduler.RecordToFile(log_file)],
    verbose=0,
    runner = auto_scheduler.LocalRunner(timeout=100),
)
# Run auto-tuning (search)
task.tune(tune_option)
# Apply the best schedule
sch, args = task.apply_best(log_file)
func = tvm.build(sch, args, target)
dev = tvm.cpu()

X_tvm = tvm.nd.array(X_np, device=dev)
W_data_tvm = tvm.nd.array(W_sp_np.data, device=dev)
W_indices_tvm = tvm.nd.array(W_sp_np.indices, device=dev)
W_indptr_tvm = tvm.nd.array(W_sp_np.indptr, device=dev)
Y_tvm = tvm.nd.empty(Y_np.shape, device=dev)

func(X_tvm, W_data_tvm, W_indices_tvm, W_indptr_tvm, Y_tvm)

# Check results
np.testing.assert_allclose(Y_np, Y_tvm.numpy(), atol=1e-4, rtol=1e-4)
# print("check result done")
# Evaluate execution time.
evaluator = func.time_evaluator(func.entry_name, dev, min_repeat_ms=500)
print(
    "Execution time of this operator: %.3f ms"
    % (np.median(evaluator(X_tvm, W_data_tvm, W_indices_tvm, W_indptr_tvm, Y_tvm).results) * 1000)
)

Here is the error message:

Computational DAG:
placeholder = PLACEHOLDER [987]
placeholder = PLACEHOLDER [777756]
placeholder = PLACEHOLDER [777756]
X = PLACEHOLDER [986, 986]
compute(i, row) += (placeholder[(placeholder[row] + elem_idx)]*X[i, placeholder[(placeholder[row] + elem_idx)]])

----------------------------------------------------------------------
------------------------------  [ Search ]
----------------------------------------------------------------------
Generate Sketches		#s: 3
Traceback (most recent call last):
  File "auto_scheduler_sparse_gemm_cpu.py", line 173, in <module>
    task.tune(tune_option)
File "/home/Software/tvm/python/tvm/auto_scheduler/search_task.py", line 498, in tune
    _ffi_api.AutoSchedule(search_policy, tuning_options)
  File "tvm/_ffi/_cython/./packed_func.pxi", line 323, in tvm._ffi._cy3.core.PackedFuncBase.__call__
  File "tvm/_ffi/_cython/./packed_func.pxi", line 257, in tvm._ffi._cy3.core.FuncCall
  File "tvm/_ffi/_cython/./packed_func.pxi", line 246, in tvm._ffi._cy3.core.FuncCall3
  File "tvm/_ffi/_cython/./base.pxi", line 163, in tvm._ffi._cy3.core.CALL
tvm._ffi.base.TVMError: Traceback (most recent call last):
 
  6: TVMFuncCall
  5: tvm::runtime::PackedFuncObj::Extractor<tvm::runtime::PackedFuncSubObj<tvm::runtime::TypedPackedFunc<tvm::runtime::Array<tvm::runtime::ObjectRef, void> (tvm::auto_scheduler::SearchPolicy, tvm::auto_scheduler::TuningOptions)>::AssignTypedLambda<tvm::auto_scheduler::{lambda(tvm::auto_scheduler::SearchPolicy, tvm::auto_scheduler::TuningOptions)#3}>(tvm::auto_scheduler::{lambda(tvm::auto_scheduler::SearchPolicy, tvm::auto_scheduler::TuningOptions)#3}, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >)::{lambda(tvm::runtime::TVMArgs const&, tvm::runtime::TVMRetValue*)#1}> >::Call(tvm::runtime::PackedFuncObj const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, tvm::runtime::TVMRetValue)
  4: tvm::auto_scheduler::AutoSchedule(tvm::auto_scheduler::SearchPolicy, tvm::auto_scheduler::TuningOptions)
  3: tvm::auto_scheduler::SketchPolicyNode::Search(int, int, int, tvm::auto_scheduler::ProgramMeasurer)
  2: tvm::auto_scheduler::SketchPolicyNode::SearchOneRound(int, tvm::runtime::Array<tvm::auto_scheduler::State, void>*)
  1: tvm::auto_scheduler::SketchPolicyNode::SampleInitPopulation(tvm::runtime::Array<tvm::auto_scheduler::State, void> const&)
  0: tvm::support::parallel_for(int, int, std::function<void (int)> const&, int, std::function<std::vector<std::vector<int, std::allocator<int> >, std::allocator<std::vector<int, std::allocator<int> > > > (int, int, int, int)>)
  10: 0xffffffffffffffff
  9: __clone
  8: start_thread
        at /build/glibc-uZu3wS/glibc-2.27/nptl/pthread_create.c:463
  7: 0x00007ff2ed6196de
  6: std::thread::_State_impl<std::thread::_Invoker<std::tuple<std::packaged_task<void (std::vector<int, std::allocator<int> > const&, std::function<void (int)> const&)>, std::vector<int, std::allocator<int> >, std::function<void (int)> > > >::_M_run()
  5: void std::call_once<void (std::__future_base::_State_baseV2::*)(std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*, bool*), std::__future_base::_State_baseV2*, std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*, bool*>(std::once_flag&, void (std::__future_base::_State_baseV2::*&&)(std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*, bool*), std::__future_base::_State_baseV2*&&, std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*&&, bool*&&)
  4: __pthread_once_slow
        at /build/glibc-uZu3wS/glibc-2.27/nptl/pthread_once.c:116
  3: std::__future_base::_State_baseV2::_M_do_set(std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*, bool*)
  2: _ZNSt17_Function_handlerIFSt10unique_ptrINSt13__future_base12_Result_baseENS2_8_DeleterEEvENS1_12_Task_setterIS0_INS1_7_ResultIvEES3_EZNS1_11_Task_stateIZN3tvm7support12parallel_forEiiRKSt8functionIFviEEiSD_IFSt6vectorISI_IiSaIiEESaISK_EEiiiiEEEUlRKSK_SH_E_SJ_FvSQ_SH_EE6_M_runESQ_SH_EUlvE_vEEE9_M_in
  1: std::_Function_handler<void (int), tvm::auto_scheduler::SketchPolicyNode::SampleInitPopulation(tvm::runtime::Array<tvm::auto_scheduler::State, void> const&)::{lambda(int)#1}>::_M_invoke(std::_Any_data const&, int&&)
  0: tvm::auto_scheduler::InitFillTileSize::Apply(tvm::auto_scheduler::SketchPolicyNode*, tvm::auto_scheduler::State*, std::mersenne_twister_engine<unsigned long, 32ul, 624ul, 397ul, 31ul, 2567483615ul, 11ul, 4294967295ul, 7ul, 2636928640ul, 15ul, 4022730752ul, 18ul, 1812433253ul>*) const

File "/home/Software/tvm/src/support/parallel_for.cc", line 92
TVMError: Parallel_for error with [14:39:10] /home/Software/tvm/src/auto_scheduler/search_policy/sketch_policy_rules.cc:515:
---------------------------------------------------------------
An error occurred during the execution of TVM.
For more information, please see: https://tvm.apache.org/docs/errors.html
---------------------------------------------------------------
  Check failed: (ps->extent) is false:

I also tried to define the search policy as follow, but is does not work either. The error message is the same as above

search_policy = auto_scheduler.SketchPolicy(
                    task,
                    program_cost_model=auto_scheduler.XGBModel(),)
task.tune(tune_option, search_policy)

Thanks for your time!