TVMError: Check failed: (status == cutlass::Status::kSuccess) is false

Hi, When I use the example in I get the following error:

INFO:topi:dense pad_to_tensorcore, extra_flops_ratio 0.002197802197802198
INFO:te_compiler:Using injective.cuda for nn.pad based on highest priority (10)
WARNING:autotvm:One or more operators have not been tuned. Please tune your model for better performance. Use DEBUG logging level to see more details.
INFO:te_compiler:Using dense_tensorcore.cuda for nn.dense based on highest priority (20)
INFO:te_compiler:Using injective.cuda for strided_slice based on highest priority (10)
INFO:cutlass:Picked the first kernel found cutlass_tensorop_h1688gemm_256x128_32x2_tn_align4
Traceback (most recent call last):
  File "", line 542, in <module>
  File "", line 301, in test_dense
    verify_dense(get_dense(M, N, K), M, N, K)
  File "", line 248, in verify_dense
    out = get_output(rt_mod, ["data"], [x])
  File "", line 62, in get_output
  File "/data/tvm-0.9/python/tvm/contrib/", line 207, in run
  File "/data/tvm-0.9/python/tvm/_ffi/_ctypes/", line 237, in __call__
    raise get_last_ffi_error()
tvm._ffi.base.TVMError: Traceback (most recent call last):
  6: TVMFuncCall
  5: tvm::runtime::GraphExecutor::Run()
  4: _ZNSt17_Function_handlerIFvvEZN3tvm7runtime13GraphExecutor11CreateTVMOpERKNS2_10TVMOpParamERKSt6vectorI8DLTensorSaIS8_EEEUlvE1
  3: std::_Function_handler<void (tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*), tvm::runtime::WrapPackedFunc(int (*)(TVMValue*, int*, int, TVMValue*, int*, void*), tvm::runtime::ObjectPtr<tvm::runtime::Object> const&)::{lambda(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)#1}>::_M_invoke(std::_Any_data const&, tvm::runtime::TVMArgs&&, tvm::runtime::TVMRetValue*&&)
  2: tvmgen_default_cutlass_main
  1: _Z38tvmgen_defau
  0: tvmgen_default_cutlass_main_0_(DLTensor*, DLTensor*, DLTensor*)
  File "./tmp/", line 81
TVMError: Check failed: (status == cutlass::Status::kSuccess) is false:

tvm : f94ea8 cuda :11.0 python : 3.7.7

Can you help me see what the problem is?@masahi thx

What is your GPU? Have you modified

This shouldn’t happen. I regularly run all tests on rtx 3070, all pass.

My gpu is t4. I only test the test_dense function, the rest are not modified.

I also tried the bert example from and it doesn’t work either. When I used the Bert example in this project, I found that no cu files were generated in tmp. A closer look reveals that when the bert model uses partition_for_cutlass, no functions starting with cutlass are generated. When I remove check_gemm in python/tvm/relay/op/contrib/ partition_for_cutlass, I can generate functions starting with ‘cutlass’, but still can’t run.

Sorry I’ve never tested on turing cards. All of tests and code in my repo use hard-coded sm = 80 (for example and Have you tried sm = 75?

Thanks, when I adjusted sm to 75, had no problem anymore. But there is still a problem with the bert model, the error log is as follows:

  %1764 = reshape(%1763, newshape=[1, 128, 1024]) /* ty=Tensor[(1, 128, 1024), float32] */;
  %1765 = nn.bias_add(%1764, %model.bert.encoder.layer.23.output.dense.bias, axis=-1) /* ty=Tensor[(1, 128, 1024), float32] */;
  %1766 = nn.dropout(%1765, rate=0.1f) /* ty=(Tensor[(1, 128, 1024), float32], Tensor[(1, 128, 1024), float32]) */;
  %1767 = %1766.0;
  %1768 = add(%1767, %1748) /* ty=Tensor[(1, 128, 1024), float32] */;
  %1769 = nn.layer_norm(%1768, %model.bert.encoder.layer.23.output.LayerNorm.weight, %model.bert.encoder.layer.23.output.LayerNorm.bias, epsilon=1e-12f) /* ty=Tensor[(1, 128, 1024), float32] */;
  %1770 = take(%1769, 0 /* ty=int32 */, axis=1, mode="wrap") /* ty=Tensor[(1, 1024), float32] */;
  %1771 = nn.dense(%1770, %model.bert.pooler.dense.weight, units=None) /* ty=Tensor[(1, 1024), float32] */;
  %1772 = nn.bias_add(%1771, %model.bert.pooler.dense.bias, axis=-1) /* ty=Tensor[(1, 1024), float32] */;
  %1773 = tanh(%1772) /* ty=Tensor[(1, 1024), float32] */;
  %1774 = nn.dropout(%1773, rate=0.1f) /* ty=(Tensor[(1, 1024), float32], Tensor[(1, 1024), float32]) */;
  %1775 = %1774.0;
  %1776 = nn.dense(%1775, %model.classifier.weight, units=None) /* ty=Tensor[(1, 2), float32] */;
  nn.bias_add(%1776, %model.classifier.bias, axis=-1) /* ty=Tensor[(1, 2), float32] */

One or more operators have not been tuned. Please tune your model for better performance. Use DEBUG logging level to see more details.
Traceback (most recent call last):
  File "", line 35, in <module>
    assert num_partition > 0

I suspect a problem with partition_for_cutlass.

Looks like all nn.dense ops are done in fp32. We only support fp16 to run them on the tensorcore.

When you ran, make sure you have

thx, The problem is solved. :grin:

1 Like

Hi @masahi : I also test tvm-cutlass-eval /resnet50 / on my RTX 3070. However, I get similar error:

Traceback (most recent call last): File “”, line 94, in rt_mod, dev, num_partition = profile_and_build(mod, params, sm, tmp_dir="…/maskrcnn/tmp", lib_path=“”, precompiled=False) File “”, line 39, in profile_and_build rt_mod = tvm.contrib.graph_executor.GraphModule(lib"default") File “/headless/Desktop/tvm/python/tvm/_ffi/_ctypes/”, line 237, in call raise get_last_ffi_error() tvm._ffi.base.TVMError: Traceback (most recent call last): 5: TVMFuncCall 4: tvm::runtime::PackedFuncObj::Extractor<tvm::runtime::PackedFuncSubObj<tvm::runtime::GraphExecutorFactory::GetFunction(std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&, tvm::runtime::ObjectPtrtvm::runtime::Object const&)::{lambda(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)#1}> >::Call(tvm::runtime::PackedFuncObj const*, tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*) 3: tvm::runtime::GraphExecutorFactory::ExecutorCreate(std::vector<DLDevice, std::allocator > const&) 2: tvm::runtime::GraphExecutor::Init(std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&, tvm::runtime::Module, std::vector<DLDevice, std::allocator > const&, tvm::runtime::PackedFunc) 1: tvm::runtime::GraphExecutor::SetupOpExecs() 0: tvm::runtime::GraphExecutor::CreateTVMOp(tvm::runtime::TVMOpParam const&, std::vector<DLTensor, std::allocator > const&) File “/headless/Desktop/tvm/src/runtime/graph_executor/”, line 529 TVMError:

An error occurred during the execution of TVM. For more information, please see:

Check failed: (pf != nullptr) is false: no such function in module: tvmgen_default_cutlass_main_0

Any advice? Thanks in advance!