Performance regression of sparse BERT example

I’m looking at the sparse BERT example, and have found that the performance is now worse than expected on recent versions.

Running the tutorial Python script on my i7-8700, with the latest commit (683c5ebf), I get:

Dense Runtime:             326.39 ms           (95.03 ms)
Sparse Runtime:            476.07 ms           (107.11 ms)

This is a significant slowdown, which is present even when I set gen_weights=True. I also see the same behaviour on an earlier v0.8 commit 90fb6266030.

When I checkout the v0.7 branch, and compile and run the equivalent code for that version, I get:

Dense Runtime:             651.51 ms           (105.23 ms)
Sparse Runtime:            146.15 ms           (0.20 ms)

This is the kind of speedup I would be hoping for. I acknowledge that in v0.8 the dense time has improved, but why has sparse time gotten worse (even with gen_weights=True)

When I try the v0.8 commit 683c5ebf on another machine (Intel Xeon E5-2620), I see a slight improvement, but still much less than expected.

Dense Runtime:             445.20 ms           (2.35 ms)
Sparse Runtime:            406.12 ms           (0.33 ms)

I will be stepping back through the commit history to understand why the effectiveness of sparsity has been reduced, however are there any ideas?

@jwfromm I believe you wrote the tutorial?

Update, some quick and dirty checks across the commit history:

Commit Date Dense Sparse
683c5ebf 2021/07/09 329.40 470.13
90fb6266 2021/06/14 326.39 476.07
bd4b14d6 2021/06/01 345.01 315.59
cea7cf16 2021/05/03 332.42 368.18

Can you try tuning the operators? Performance on untuned operators can be all over the place.

Thanks, I have made a modified version of the tutorial script with AutoTVM support available at this gist

However it seems to fail on compile with:

  File "/home/whest/proj/tvm-standalone-tests/tune_sparse_test.py", line 451, in tune_dense
    return autotune_and_evaluate(
  File "/home/whest/proj/tvm-standalone-tests/tune_sparse_test.py", line 308, in autotune_and_evaluate
    run_relay_graph_with_log(mod, params, shape_dict, target, dev, log_file)
  File "/home/whest/proj/tvm-standalone-tests/tune_sparse_test.py", line 345, in run_relay_graph_with_log
    lib = relay.build_module.build(mod, target=target, params=params)
  File "/home/whest/tools/tvm/python/tvm/relay/build_module.py", line 332, in build
    executor_config, runtime_mod, params = bld_mod.build(
  File "/home/whest/tools/tvm/python/tvm/relay/build_module.py", line 148, in build
    self._build(mod, target, target_host, executor)
  File "/home/whest/tools/tvm/python/tvm/_ffi/_ctypes/packed_func.py", line 237, in __call__
    raise get_last_ffi_error()
tvm._ffi.base.TVMError: Traceback (most recent call last):
  37: TVMFuncCall
  36: _ZNSt17_Function_handlerIFvN3tvm7runtime7TVMArgsEPNS1_11
  35: tvm::relay::backend::RelayBuildModule::GetFunction(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, tvm::runtime::ObjectPtr<tvm::runtime::Object>
const&)::{lambda(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)#3}::operator()(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*) const
  34: tvm::relay::backend::RelayBuildModule::Build(tvm::IRModule, tvm::runtime::Map<tvm::Integer, tvm::Target, void, void> const&, tvm::Target const&, tvm::runtime::String)
  33: tvm::relay::backend::RelayBuildModule::BuildRelay(tvm::IRModule, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, tvm::runtime::NDArra
y, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >
, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, tvm::runtime::NDArray> > > const&)
  32: tvm::relay::backend::RelayBuildModule::Optimize(tvm::IRModule, tvm::runtime::Map<tvm::Integer, tvm::Target, void, void> const&, std::unordered_map<std::__cxx11::basic_string<char, std::
char_traits<char>, std::allocator<char> >, tvm::runtime::NDArray, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basi
c_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, tvm::runtime::
NDArray> > > const&)
  31: tvm::transform::Pass::operator()(tvm::IRModule) const
  30: tvm::transform::Pass::operator()(tvm::IRModule, tvm::transform::PassContext const&) const
  29: tvm::transform::SequentialNode::operator()(tvm::IRModule, tvm::transform::PassContext const&) const
  28: tvm::transform::Pass::operator()(tvm::IRModule, tvm::transform::PassContext const&) const
  27: tvm::relay::transform::FunctionPassNode::operator()(tvm::IRModule, tvm::transform::PassContext const&) const
  26: std::_Function_handler<void (tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*), tvm::runtime::TypedPackedFunc<tvm::relay::Function (tvm::relay::Function, tvm::IRModule, tvm::transform:
:PassContext)>::AssignTypedLambda<tvm::relay::transform::AlterOpLayout()::$_1>(tvm::relay::transform::AlterOpLayout()::$_1)::{lambda(tvm::runtime::TVMArgs const&, tvm::runtime::TVMRetValue*)#
1}>::_M_invoke(std::_Any_data const&, tvm::runtime::TVMArgs&&, tvm::runtime::TVMRetValue*&&)
  25: tvm::relay::alter_op_layout::AlterOpLayout(tvm::RelayExpr const&)
  24: tvm::relay::ForwardRewrite(tvm::RelayExpr const&, tvm::runtime::TypedPackedFunc<tvm::RelayExpr (tvm::relay::Call const&, tvm::runtime::Array<tvm::RelayExpr, void> const&, tvm::runtime::
ObjectRef const&)> const&, std::function<tvm::runtime::ObjectRef (tvm::relay::Call const&)>, std::function<tvm::RelayExpr (tvm::RelayExpr const&)>)
  23: tvm::relay::ForwardRewriter::Rewrite(tvm::RelayExpr const&)
  22: tvm::relay::MixedModeMutator::VisitExpr(tvm::RelayExpr const&)
  21: tvm::relay::MixedModeMutator::VisitLeaf(tvm::RelayExpr const&)
  20: _ZN3tvm5relay1
  19: tvm::relay::ExprMutator::VisitExpr(tvm::RelayExpr const&)
  18: tvm::relay::ExprFunctor<tvm::RelayExpr (tvm::RelayExpr const&)>::VisitExpr(tvm::RelayExpr const&)
  17: tvm::NodeFunctor<tvm::RelayExpr (tvm::runtime::ObjectRef const&, tvm::relay::ExprFunctor<tvm::RelayExpr (tvm::RelayExpr const&)>*)>::operator()(tvm::runtime::ObjectRef const&, tvm::rela
y::ExprFunctor<tvm::RelayExpr (tvm::RelayExpr const&)>*) const
  16: _ZZN3tvm5relay11ExprFunc
  15: tvm::relay::ExprMutator::VisitExpr_(tvm::relay::FunctionNode const*)
  14: tvm::relay::MixedModeMutator::VisitExpr(tvm::RelayExpr const&)
  13: tvm::relay::MixedModeMutator::VisitLeaf(tvm::RelayExpr const&)
  12: _ZN3tvm5relay1
  11: tvm::relay::ExprMutator::VisitExpr(tvm::RelayExpr const&)                                                                                           unctor<tvm::RelayExpr (tvm::RelayExpr const&)>::VisitExpr(tvm::RelayExpr const&)
  10: tvm::relay::ExprFunctor<tvm::RelayExpr (tvm::RelayExpr const&)>::VisitExpr(tvm::RelayExpr const&)
  9: tvm::NodeFunctor<tvm::RelayExpr (tvm::runtime::ObjectRef const&, tvm::relay::ExprFunctor<tvm::RelayExpr (tvm::RelayExpr const&)>*)>::operator()(tvm::runtime::ObjectRef const&, tvm::relay::ExprFunctor<tvm::RelayExpr (tvm::RelayExpr const&)>*) const
  8: _ZZN3tvm5relay11ExprFunc
  7: _ZN3tvm5relay1
  6: tvm::RelayExpr tvm::relay::MixedModeMutator::Rewrite<tvm::relay::CallNode>(tvm::relay::CallNode const*)
  5: tvm::relay::ForwardRewriter::Rewrite_(tvm::relay::CallNode const*, tvm::RelayExpr const&)
  4: tvm::runtime::TVMRetValue tvm::runtime::PackedFunc::operator()<tvm::relay::Call const&, tvm::runtime::Array<tvm::RelayExpr, void>&, tvm::runtime::ObjectRef>(tvm::relay::Call const&, tvm::runtime::Array<tvm::RelayExpr, void>&, tvm::runtime::ObjectRef&&) const
  3: tvm::runtime::TypedPackedFunc<tvm::RelayExpr (tvm::relay::Call const&, tvm::runtime::Array<tvm::RelayExpr, void> const&, tvm::runtime::ObjectRef const&)>::AssignTypedLambda<tvm::RelayExpr (*)(tvm::relay::Call const&, tvm::runtime::Array<tvm::RelayExpr, void> const&, tvm::runtime::ObjectRef const&)>(tvm::RelayExpr (*)(tvm::relay::Call const&, tvm::runtime::Array<tvm::RelayExpr, void> const&, tvm::runtime::ObjectRef const&))::{lambda(tvm::runtime::TVMArgs const&, tvm::runtime::TVMRetValue*)#1}::operator()(tvm::runtime::TVMArgs const&, tvm::runtime::TVMRetValue*) const
  2: tvm::RelayExpr tvm::relay::LayoutRewriter<tvm::relay::alter_op_layout::AlterTransformMemorizer>(tvm::relay::Call const&, tvm::runtime::Array<tvm::RelayExpr, void> const&, tvm::runtime::ObjectRef const&)
  1: tvm::relay::alter_op_layout::AlterTransformMemorizer::CallWithNewLayouts(tvm::relay::Call const&, std::vector<tvm::RelayExpr, std::allocator<tvm::RelayExpr> > const&)
  0: std::_Function_handler<void (tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*), TVMFuncCreateFromCFunc::$_2>::_M_invoke(std::_Any_data const&, tvm::runtime::TVMArgs&&, tvm::runtime::TVMRetValue*&&)
  File "/home/whest/tools/tvm/python/tvm/_ffi/_ctypes/packed_func.py", line 81, in cfun
    rv = local_pyfunc(*pyargs)
  File "/home/whest/tools/tvm/python/tvm/relay/op/nn/_nn.py", line 84, in alter_op_layout_dense
    return topi.nn.dense_alter_layout(attrs, inputs, tinfos, out_type)
  File "/home/whest/.virtualenvs/trans/lib/python3.9/site-packages/decorator.py", line 232, in fun
    return caller(func, *(extras + args), **kw)
  File "/home/whest/tools/tvm/python/tvm/target/generic_func.py", line 275, in dispatch_func
    return dispatch_dict[k](*args, **kwargs)
  File "/home/whest/tools/tvm/python/tvm/topi/x86/dense_alter_op.py", line 38, in _alter_dense_layout
    impl, outs = relay.backend.compile_engine.select_implementation(
  File "/home/whest/tools/tvm/python/tvm/relay/backend/compile_engine.py", line 219, in select_implementation
    outs = impl.compute(attrs, inputs, out_type)
  File "/home/whest/tools/tvm/python/tvm/relay/op/op.py", line 125, in compute
    return _OpImplementationCompute(self, attrs, inputs, out_type)
  File "/home/whest/tools/tvm/python/tvm/_ffi/_ctypes/packed_func.py", line 237, in __call__
    raise get_last_ffi_error()
  3: TVMFuncCall
  2: std::_Function_handler<void (tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*), tvm::relay::$_3>::_M_invoke(std::_Any_data const&, tvm::runtime::TVMArgs&&, tvm::runtime::TVMRetValue*&&)
  1: tvm::relay::OpImplementation::Compute(tvm::Attrs const&, tvm::runtime::Array<tvm::te::Tensor, void> const&, tvm::Type const&)
  0: std::_Function_handler<void (tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*), TVMFuncCreateFromCFunc::$_2>::_M_invoke(std::_Any_data const&, tvm::runtime::TVMArgs&&, tvm::runtime::TVMRetValue*&&)
  File "/home/whest/tools/tvm/python/tvm/_ffi/_ctypes/packed_func.py", line 81, in cfun
    rv = local_pyfunc(*pyargs)
  File "/home/whest/tools/tvm/python/tvm/relay/op/strategy/generic.py", line 726, in _compute_dense
    return [topi_compute(*args)]
  File "/home/whest/tools/tvm/python/tvm/autotvm/task/topi_integration.py", line 164, in wrapper
    cfg = DispatchContext.current.query(tgt, workload)
  File "/home/whest/tools/tvm/python/tvm/autotvm/task/dispatcher.py", line 76, in query
    ret = self._query_inside(target, workload)
  File "/home/whest/tools/tvm/python/tvm/autotvm/task/dispatcher.py", line 421, in _query_inside
    assert wkl == workload
TVMError: AssertionError

I’m not sure about this specific error, but you should use python 3.8 or 3.7. 3.9 is not supported.

Yes, thanks I realise that this has been a source of a couple of my recent problems. I have fixed my frankendebian, and have a new environment using python3.7.

However, this error reproduces with this environment.

Though I don’t need to necessarily fix it, what I am trying to understand is why the default schedule for the sparse dense operation is now so slow?

Looking at the code, it still seems to be using the ir_builder directly as it did in v0.7. Could the way the default schedule is constructed have been compromised in recent updates (since handwritten and auto-tuned scheduled are recommended)?

Your tuning script is not tuning all the operators. You need to make sure it is tuning sparse_dense. Not sure where your error is coming from though.

The sparse operators on CPU are using TE + scheduling, not ir_builder. See tvm/x86.py at main · apache/tvm · GitHub which calls tvm/sparse.py at main · apache/tvm · GitHub.