Performance regression of sparse BERT example

Wheest · July 9, 2021, 5:50pm

I’m looking at the sparse BERT example, and have found that the performance is now worse than expected on recent versions.

Running the tutorial Python script on my i7-8700, with the latest commit (683c5ebf), I get:

Dense Runtime:             326.39 ms           (95.03 ms)
Sparse Runtime:            476.07 ms           (107.11 ms)

This is a significant slowdown, which is present even when I set gen_weights=True. I also see the same behaviour on an earlier v0.8 commit 90fb6266030.

When I checkout the v0.7 branch, and compile and run the equivalent code for that version, I get:

Dense Runtime:             651.51 ms           (105.23 ms)
Sparse Runtime:            146.15 ms           (0.20 ms)

This is the kind of speedup I would be hoping for. I acknowledge that in v0.8 the dense time has improved, but why has sparse time gotten worse (even with gen_weights=True)

When I try the v0.8 commit 683c5ebf on another machine (Intel Xeon E5-2620), I see a slight improvement, but still much less than expected.

Dense Runtime:             445.20 ms           (2.35 ms)
Sparse Runtime:            406.12 ms           (0.33 ms)

I will be stepping back through the commit history to understand why the effectiveness of sparsity has been reduced, however are there any ideas?

@jwfromm I believe you wrote the tutorial?

Update, some quick and dirty checks across the commit history:

Commit	Date	Dense	Sparse
683c5ebf	2021/07/09	329.40	470.13
90fb6266	2021/06/14	326.39	476.07
bd4b14d6	2021/06/01	345.01	315.59
cea7cf16	2021/05/03	332.42	368.18

tkonolige · July 9, 2021, 7:06pm

Can you try tuning the operators? Performance on untuned operators can be all over the place.

Wheest · July 10, 2021, 4:57pm

Thanks, I have made a modified version of the tutorial script with AutoTVM support available at this gist

However it seems to fail on compile with:

  File "/home/whest/proj/tvm-standalone-tests/tune_sparse_test.py", line 451, in tune_dense
    return autotune_and_evaluate(
  File "/home/whest/proj/tvm-standalone-tests/tune_sparse_test.py", line 308, in autotune_and_evaluate
    run_relay_graph_with_log(mod, params, shape_dict, target, dev, log_file)
  File "/home/whest/proj/tvm-standalone-tests/tune_sparse_test.py", line 345, in run_relay_graph_with_log
    lib = relay.build_module.build(mod, target=target, params=params)
  File "/home/whest/tools/tvm/python/tvm/relay/build_module.py", line 332, in build
    executor_config, runtime_mod, params = bld_mod.build(
  File "/home/whest/tools/tvm/python/tvm/relay/build_module.py", line 148, in build
    self._build(mod, target, target_host, executor)
  File "/home/whest/tools/tvm/python/tvm/_ffi/_ctypes/packed_func.py", line 237, in __call__
    raise get_last_ffi_error()
tvm._ffi.base.TVMError: Traceback (most recent call last):
  37: TVMFuncCall
  36: _ZNSt17_Function_handlerIFvN3tvm7runtime7TVMArgsEPNS1_11
  35: tvm::relay::backend::RelayBuildModule::GetFunction(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, tvm::runtime::ObjectPtr<tvm::runtime::Object>
const&)::{lambda(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)#3}::operator()(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*) const
  34: tvm::relay::backend::RelayBuildModule::Build(tvm::IRModule, tvm::runtime::Map<tvm::Integer, tvm::Target, void, void> const&, tvm::Target const&, tvm::runtime::String)
  33: tvm::relay::backend::RelayBuildModule::BuildRelay(tvm::IRModule, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, tvm::runtime::NDArra
y, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >
, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, tvm::runtime::NDArray> > > const&)
  32: tvm::relay::backend::RelayBuildModule::Optimize(tvm::IRModule, tvm::runtime::Map<tvm::Integer, tvm::Target, void, void> const&, std::unordered_map<std::__cxx11::basic_string<char, std::
char_traits<char>, std::allocator<char> >, tvm::runtime::NDArray, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basi
c_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, tvm::runtime::
NDArray> > > const&)
  31: tvm::transform::Pass::operator()(tvm::IRModule) const
  30: tvm::transform::Pass::operator()(tvm::IRModule, tvm::transform::PassContext const&) const
  29: tvm::transform::SequentialNode::operator()(tvm::IRModule, tvm::transform::PassContext const&) const
  28: tvm::transform::Pass::operator()(tvm::IRModule, tvm::transform::PassContext const&) const
  27: tvm::relay::transform::FunctionPassNode::operator()(tvm::IRModule, tvm::transform::PassContext const&) const
  26: std::_Function_handler<void (tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*), tvm::runtime::TypedPackedFunc<tvm::relay::Function (tvm::relay::Function, tvm::IRModule, tvm::transform:
:PassContext)>::AssignTypedLambda<tvm::relay::transform::AlterOpLayout()::$_1>(tvm::relay::transform::AlterOpLayout()::$_1)::{lambda(tvm::runtime::TVMArgs const&, tvm::runtime::TVMRetValue*)#
1}>::_M_invoke(std::_Any_data const&, tvm::runtime::TVMArgs&&, tvm::runtime::TVMRetValue*&&)
  25: tvm::relay::alter_op_layout::AlterOpLayout(tvm::RelayExpr const&)
  24: tvm::relay::ForwardRewrite(tvm::RelayExpr const&, tvm::runtime::TypedPackedFunc<tvm::RelayExpr (tvm::relay::Call const&, tvm::runtime::Array<tvm::RelayExpr, void> const&, tvm::runtime::
ObjectRef const&)> const&, std::function<tvm::runtime::ObjectRef (tvm::relay::Call const&)>, std::function<tvm::RelayExpr (tvm::RelayExpr const&)>)
  23: tvm::relay::ForwardRewriter::Rewrite(tvm::RelayExpr const&)
  22: tvm::relay::MixedModeMutator::VisitExpr(tvm::RelayExpr const&)
  21: tvm::relay::MixedModeMutator::VisitLeaf(tvm::RelayExpr const&)
  20: _ZN3tvm5relay1
  19: tvm::relay::ExprMutator::VisitExpr(tvm::RelayExpr const&)
  18: tvm::relay::ExprFunctor<tvm::RelayExpr (tvm::RelayExpr const&)>::VisitExpr(tvm::RelayExpr const&)
  17: tvm::NodeFunctor<tvm::RelayExpr (tvm::runtime::ObjectRef const&, tvm::relay::ExprFunctor<tvm::RelayExpr (tvm::RelayExpr const&)>*)>::operator()(tvm::runtime::ObjectRef const&, tvm::rela
y::ExprFunctor<tvm::RelayExpr (tvm::RelayExpr const&)>*) const
  16: _ZZN3tvm5relay11ExprFunc
  15: tvm::relay::ExprMutator::VisitExpr_(tvm::relay::FunctionNode const*)
  14: tvm::relay::MixedModeMutator::VisitExpr(tvm::RelayExpr const&)
  13: tvm::relay::MixedModeMutator::VisitLeaf(tvm::RelayExpr const&)
  12: _ZN3tvm5relay1
  11: tvm::relay::ExprMutator::VisitExpr(tvm::RelayExpr const&)                                                                                           unctor<tvm::RelayExpr (tvm::RelayExpr const&)>::VisitExpr(tvm::RelayExpr const&)
  10: tvm::relay::ExprFunctor<tvm::RelayExpr (tvm::RelayExpr const&)>::VisitExpr(tvm::RelayExpr const&)
  9: tvm::NodeFunctor<tvm::RelayExpr (tvm::runtime::ObjectRef const&, tvm::relay::ExprFunctor<tvm::RelayExpr (tvm::RelayExpr const&)>*)>::operator()(tvm::runtime::ObjectRef const&, tvm::relay::ExprFunctor<tvm::RelayExpr (tvm::RelayExpr const&)>*) const
  8: _ZZN3tvm5relay11ExprFunc
  7: _ZN3tvm5relay1
  6: tvm::RelayExpr tvm::relay::MixedModeMutator::Rewrite<tvm::relay::CallNode>(tvm::relay::CallNode const*)
  5: tvm::relay::ForwardRewriter::Rewrite_(tvm::relay::CallNode const*, tvm::RelayExpr const&)
  4: tvm::runtime::TVMRetValue tvm::runtime::PackedFunc::operator()<tvm::relay::Call const&, tvm::runtime::Array<tvm::RelayExpr, void>&, tvm::runtime::ObjectRef>(tvm::relay::Call const&, tvm::runtime::Array<tvm::RelayExpr, void>&, tvm::runtime::ObjectRef&&) const
  3: tvm::runtime::TypedPackedFunc<tvm::RelayExpr (tvm::relay::Call const&, tvm::runtime::Array<tvm::RelayExpr, void> const&, tvm::runtime::ObjectRef const&)>::AssignTypedLambda<tvm::RelayExpr (*)(tvm::relay::Call const&, tvm::runtime::Array<tvm::RelayExpr, void> const&, tvm::runtime::ObjectRef const&)>(tvm::RelayExpr (*)(tvm::relay::Call const&, tvm::runtime::Array<tvm::RelayExpr, void> const&, tvm::runtime::ObjectRef const&))::{lambda(tvm::runtime::TVMArgs const&, tvm::runtime::TVMRetValue*)#1}::operator()(tvm::runtime::TVMArgs const&, tvm::runtime::TVMRetValue*) const
  2: tvm::RelayExpr tvm::relay::LayoutRewriter<tvm::relay::alter_op_layout::AlterTransformMemorizer>(tvm::relay::Call const&, tvm::runtime::Array<tvm::RelayExpr, void> const&, tvm::runtime::ObjectRef const&)
  1: tvm::relay::alter_op_layout::AlterTransformMemorizer::CallWithNewLayouts(tvm::relay::Call const&, std::vector<tvm::RelayExpr, std::allocator<tvm::RelayExpr> > const&)
  0: std::_Function_handler<void (tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*), TVMFuncCreateFromCFunc::$_2>::_M_invoke(std::_Any_data const&, tvm::runtime::TVMArgs&&, tvm::runtime::TVMRetValue*&&)
  File "/home/whest/tools/tvm/python/tvm/_ffi/_ctypes/packed_func.py", line 81, in cfun
    rv = local_pyfunc(*pyargs)
  File "/home/whest/tools/tvm/python/tvm/relay/op/nn/_nn.py", line 84, in alter_op_layout_dense
    return topi.nn.dense_alter_layout(attrs, inputs, tinfos, out_type)
  File "/home/whest/.virtualenvs/trans/lib/python3.9/site-packages/decorator.py", line 232, in fun
    return caller(func, *(extras + args), **kw)
  File "/home/whest/tools/tvm/python/tvm/target/generic_func.py", line 275, in dispatch_func
    return dispatch_dict[k](*args, **kwargs)
  File "/home/whest/tools/tvm/python/tvm/topi/x86/dense_alter_op.py", line 38, in _alter_dense_layout
    impl, outs = relay.backend.compile_engine.select_implementation(
  File "/home/whest/tools/tvm/python/tvm/relay/backend/compile_engine.py", line 219, in select_implementation
    outs = impl.compute(attrs, inputs, out_type)
  File "/home/whest/tools/tvm/python/tvm/relay/op/op.py", line 125, in compute
    return _OpImplementationCompute(self, attrs, inputs, out_type)
  File "/home/whest/tools/tvm/python/tvm/_ffi/_ctypes/packed_func.py", line 237, in __call__
    raise get_last_ffi_error()
  3: TVMFuncCall
  2: std::_Function_handler<void (tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*), tvm::relay::$_3>::_M_invoke(std::_Any_data const&, tvm::runtime::TVMArgs&&, tvm::runtime::TVMRetValue*&&)
  1: tvm::relay::OpImplementation::Compute(tvm::Attrs const&, tvm::runtime::Array<tvm::te::Tensor, void> const&, tvm::Type const&)
  0: std::_Function_handler<void (tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*), TVMFuncCreateFromCFunc::$_2>::_M_invoke(std::_Any_data const&, tvm::runtime::TVMArgs&&, tvm::runtime::TVMRetValue*&&)
  File "/home/whest/tools/tvm/python/tvm/_ffi/_ctypes/packed_func.py", line 81, in cfun
    rv = local_pyfunc(*pyargs)
  File "/home/whest/tools/tvm/python/tvm/relay/op/strategy/generic.py", line 726, in _compute_dense
    return [topi_compute(*args)]
  File "/home/whest/tools/tvm/python/tvm/autotvm/task/topi_integration.py", line 164, in wrapper
    cfg = DispatchContext.current.query(tgt, workload)
  File "/home/whest/tools/tvm/python/tvm/autotvm/task/dispatcher.py", line 76, in query
    ret = self._query_inside(target, workload)
  File "/home/whest/tools/tvm/python/tvm/autotvm/task/dispatcher.py", line 421, in _query_inside
    assert wkl == workload
TVMError: AssertionError

tkonolige · July 12, 2021, 4:03pm

I’m not sure about this specific error, but you should use python 3.8 or 3.7. 3.9 is not supported.

Wheest · July 12, 2021, 4:57pm

Yes, thanks I realise that this has been a source of a couple of my recent problems. I have fixed my frankendebian, and have a new environment using python3.7.

However, this error reproduces with this environment.

Though I don’t need to necessarily fix it, what I am trying to understand is why the default schedule for the sparse dense operation is now so slow?

Looking at the code, it still seems to be using the ir_builder directly as it did in v0.7. Could the way the default schedule is constructed have been compromised in recent updates (since handwritten and auto-tuned scheduled are recommended)?

tkonolige · July 13, 2021, 1:53am

Your tuning script is not tuning all the operators. You need to make sure it is tuning sparse_dense. Not sure where your error is coming from though.

The sparse operators on CPU are using TE + scheduling, not ir_builder. See tvm/x86.py at main · apache/tvm · GitHub which calls tvm/sparse.py at main · apache/tvm · GitHub.