TVMError: Check failed: (status == cutlass::Status::kSuccess) is false

Hi, When I use the example in test_cutlass.py I get the following error:

INFO:topi:dense pad_to_tensorcore, extra_flops_ratio 0.002197802197802198
INFO:te_compiler:Using injective.cuda for nn.pad based on highest priority (10)
WARNING:autotvm:One or more operators have not been tuned. Please tune your model for better performance. Use DEBUG logging level to see more details.
INFO:te_compiler:Using dense_tensorcore.cuda for nn.dense based on highest priority (20)
INFO:te_compiler:Using injective.cuda for strided_slice based on highest priority (10)
INFO:cutlass:Picked the first kernel found cutlass_tensorop_h1688gemm_256x128_32x2_tn_align4
Traceback (most recent call last):
  File "test_cutlass.py", line 542, in <module>
    test_dense()
  File "test_cutlass.py", line 301, in test_dense
    verify_dense(get_dense(M, N, K), M, N, K)
  File "test_cutlass.py", line 248, in verify_dense
    out = get_output(rt_mod, ["data"], [x])
  File "test_cutlass.py", line 62, in get_output
    rt_mod.run()
  File "/data/tvm-0.9/python/tvm/contrib/graph_executor.py", line 207, in run
    self._run()
  File "/data/tvm-0.9/python/tvm/_ffi/_ctypes/packed_func.py", line 237, in __call__
    raise get_last_ffi_error()
tvm._ffi.base.TVMError: Traceback (most recent call last):
  6: TVMFuncCall
  5: tvm::runtime::GraphExecutor::Run()
  4: _ZNSt17_Function_handlerIFvvEZN3tvm7runtime13GraphExecutor11CreateTVMOpERKNS2_10TVMOpParamERKSt6vectorI8DLTensorSaIS8_EEEUlvE1
  3: std::_Function_handler<void (tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*), tvm::runtime::WrapPackedFunc(int (*)(TVMValue*, int*, int, TVMValue*, int*, void*), tvm::runtime::ObjectPtr<tvm::runtime::Object> const&)::{lambda(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)#1}>::_M_invoke(std::_Any_data const&, tvm::runtime::TVMArgs&&, tvm::runtime::TVMRetValue*&&)
  2: tvmgen_default_cutlass_main
  1: _Z38tvmgen_defau
  0: tvmgen_default_cutlass_main_0_(DLTensor*, DLTensor*, DLTensor*)
  File "./tmp/lib0.cu", line 81
TVMError: Check failed: (status == cutlass::Status::kSuccess) is false:

tvm : f94ea8 cuda :11.0 python : 3.7.7

Can you help me see what the problem is?@masahi thx

What is your GPU? Have you modified test_cutlass.py?

This shouldn’t happen. I regularly run all tests on rtx 3070, all pass.

My gpu is t4. I only test the test_dense function, the rest are not modified.

I also tried the bert example from https://github.com/masahi/tvm-cutlass-eval/ and it doesn’t work either. When I used the Bert example in this project, I found that no cu files were generated in tmp. A closer look reveals that when the bert model uses partition_for_cutlass, no functions starting with cutlass are generated. When I remove check_gemm in python/tvm/relay/op/contrib/cutlass.py partition_for_cutlass, I can generate functions starting with ‘cutlass’, but still can’t run.

Sorry I’ve never tested on turing cards. All of tests and code in my repo use hard-coded sm = 80 (for example https://github.com/apache/tvm/blob/e7f36487dfdb6c4b7b544be155d3869002d7281b/tests/python/contrib/test_cutlass.py#L213 and https://github.com/masahi/tvm-cutlass-eval/blob/master/bert/cutlass.py#L33). Have you tried sm = 75?

Thanks, when I adjusted sm to 75, test_cutlass.py had no problem anymore. But there is still a problem with the bert model, the error log is as follows:

...
  %1764 = reshape(%1763, newshape=[1, 128, 1024]) /* ty=Tensor[(1, 128, 1024), float32] */;
  %1765 = nn.bias_add(%1764, %model.bert.encoder.layer.23.output.dense.bias, axis=-1) /* ty=Tensor[(1, 128, 1024), float32] */;
  %1766 = nn.dropout(%1765, rate=0.1f) /* ty=(Tensor[(1, 128, 1024), float32], Tensor[(1, 128, 1024), float32]) */;
  %1767 = %1766.0;
  %1768 = add(%1767, %1748) /* ty=Tensor[(1, 128, 1024), float32] */;
  %1769 = nn.layer_norm(%1768, %model.bert.encoder.layer.23.output.LayerNorm.weight, %model.bert.encoder.layer.23.output.LayerNorm.bias, epsilon=1e-12f) /* ty=Tensor[(1, 128, 1024), float32] */;
  %1770 = take(%1769, 0 /* ty=int32 */, axis=1, mode="wrap") /* ty=Tensor[(1, 1024), float32] */;
  %1771 = nn.dense(%1770, %model.bert.pooler.dense.weight, units=None) /* ty=Tensor[(1, 1024), float32] */;
  %1772 = nn.bias_add(%1771, %model.bert.pooler.dense.bias, axis=-1) /* ty=Tensor[(1, 1024), float32] */;
  %1773 = tanh(%1772) /* ty=Tensor[(1, 1024), float32] */;
  %1774 = nn.dropout(%1773, rate=0.1f) /* ty=(Tensor[(1, 1024), float32], Tensor[(1, 1024), float32]) */;
  %1775 = %1774.0;
  %1776 = nn.dense(%1775, %model.classifier.weight, units=None) /* ty=Tensor[(1, 2), float32] */;
  nn.bias_add(%1776, %model.classifier.bias, axis=-1) /* ty=Tensor[(1, 2), float32] */
}

One or more operators have not been tuned. Please tune your model for better performance. Use DEBUG logging level to see more details.
Traceback (most recent call last):
  File "cutlass.py", line 35, in <module>
    assert num_partition > 0
AssertionError

I suspect a problem with partition_for_cutlass.

Looks like all nn.dense ops are done in fp32. We only support fp16 to run them on the tensorcore.

When you ran export.py, make sure you have https://github.com/masahi/tvm-cutlass-eval/blob/master/bert/export.py#L38

thx, The problem is solved. :grin:

1 Like

Hi @masahi : I also test tvm-cutlass-eval /resnet50 /run_int8.py on my RTX 3070. However, I get similar error:

Traceback (most recent call last): File “run_int8.py”, line 94, in rt_mod, dev, num_partition = profile_and_build(mod, params, sm, tmp_dir="…/maskrcnn/tmp", lib_path=“compile_resnet50_int8.so”, precompiled=False) File “run_int8.py”, line 39, in profile_and_build rt_mod = tvm.contrib.graph_executor.GraphModule(lib"default") File “/headless/Desktop/tvm/python/tvm/_ffi/_ctypes/packed_func.py”, line 237, in call raise get_last_ffi_error() tvm._ffi.base.TVMError: Traceback (most recent call last): 5: TVMFuncCall 4: tvm::runtime::PackedFuncObj::Extractor<tvm::runtime::PackedFuncSubObj<tvm::runtime::GraphExecutorFactory::GetFunction(std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&, tvm::runtime::ObjectPtrtvm::runtime::Object const&)::{lambda(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)#1}> >::Call(tvm::runtime::PackedFuncObj const*, tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*) 3: tvm::runtime::GraphExecutorFactory::ExecutorCreate(std::vector<DLDevice, std::allocator > const&) 2: tvm::runtime::GraphExecutor::Init(std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&, tvm::runtime::Module, std::vector<DLDevice, std::allocator > const&, tvm::runtime::PackedFunc) 1: tvm::runtime::GraphExecutor::SetupOpExecs() 0: tvm::runtime::GraphExecutor::CreateTVMOp(tvm::runtime::TVMOpParam const&, std::vector<DLTensor, std::allocator > const&) File “/headless/Desktop/tvm/src/runtime/graph_executor/graph_executor.cc”, line 529 TVMError:

An error occurred during the execution of TVM. For more information, please see:

Check failed: (pf != nullptr) is false: no such function in module: tvmgen_default_cutlass_main_0

Any advice? Thanks in advance!

may related not call “finalize_modules”