How to avoid an op from being fused?

wenxian · September 20, 2022, 2:27am

Hi, I encountered a fused op’s performance issue. tvmgen_default_fused_subtract_exp() runs 400 times, cost 6071ms which is bad. How can I avoid the exp() op from being fused with other op? I replaced the exp() with fast_exp(), but had no effect.

Any suggestion? Thanks.

wenxian · September 20, 2022, 7:27am

Got the answer. Set TOpPattern kOpaque. Please correct me if I were wrong.

.set_attr<TOpPattern>("TOpPattern", kOpaque)

wenxian · September 20, 2022, 8:35am

When I modified the TOpPattern of exp() op, it dumps error and prints:

terminate called after throwing an instance of 'tvm::runtime::InternalError'
  what():  [15:46:05] /data/tvm-0.7/src/ir/../node/attr_registry.h:111: 
---------------------------------------------------------------
An error occurred during the execution of TVM.
For more information, please see: https://tvm.apache.org/docs/errors.html
---------------------------------------------------------------
  Check failed: (p.second != plevel) is false: Attribute TOpPattern of exp is already registered with same plevel=10

code:
  void UpdateAttr(const String& attr_name, const KeyType& key, runtime::TVMRetValue value,
                  int plevel) {
...
    std::pair<TVMRetValue, int>& p = op_map->data_[index];
    ICHECK(p.second != plevel) << "Attribute " << attr_name << " of " << key->AttrRegistryName()
                               << " is already registered with same plevel=" << plevel;

It seems we can’t modify the TOpPattern for a built-in op, right? What shall I do if I want to have exp() un-fused?

sunggg · September 20, 2022, 3:10pm

Hi, @wenxian.

Set TOpPattern kOpaque. Please correct me if I were wrong.

.set_attr<TOpPattern>("TOpPattern", kOpaque)

I haven’t tried this yet, but I would try the same approach.

When I modified the TOpPattern of exp() op, it dumps error and prints:

I cannot be certain without looking into the exact code, but it seems like attribute for exp() op is already registered somewhere at priority 10. When you register your operator with kOpaque, it might be good to try assigning higher priority.

chenugray · September 21, 2022, 2:35am

will it be helpful to avoid exp fusion from other operator? If you have some tricky to accerate exp op individually, it will be a good choice to avoid exp fusion. If the exp doesn’t do fuse with other op, it still in the network, and will costs the same time. So I don’t get the points, why you try to avoid exp fusion.? Thanks for clarifying my confusion.

wenxian · September 21, 2022, 3:36am

Thanks for the answers.

I found where the attribute for exp() op is registered at priority 10. It’s in python/tvm/relay/op/_tensor.py

register_shape_func("exp", False, elemwise_shape_func)

def register_shape_func(op_name, data_dependent, shape_func=None, level=10):
    """Register operator shape function for an op.

Now the progress is moving on.

I made a mistake on the tensor shape, with setting the correct tensor shape, the TVM exp() op’s performance is not as good as TF or numpy’s.

(1) In the fused op in questions, the tensor shape is [100, 120, 120], float32. Running 400 times, it costs 6.02s.

(2) TVM tir.exp() with the same tensor shape without subtract, it costs 5322ms. (which is bad)

(3) numpy exp() with the same tensor shape and the same amount of computation, it costs 1933ms.

So I think the problem is no longer a fused op one now. The tir.exp() op itself is slower than TF and numpy. I tested fast_exp(), the performance is similiar with tir.exp().

The Relay IR and TIR for the subtract exp fuse op is as following:

Relay IR:
  %450 = subtract(%448, %449) /* StatefulPartitionedCall/functional_1/activation_2/sub */ /* ty=Tensor[(100, 120, 120), float32] */;
  %451 = exp(%450) /* StatefulPartitionedCall/functional_1/activation_2/Exp */ /* ty=Tensor[(100, 120, 120), float32] */;

tir:
, GlobalVar(tvmgen_default_fused_subtract_exp): PrimFunc([placeholder, placeholder, T_exp]) attrs={"from_legacy_te_schedule": (bool)1, "global_symbol": "tvmgen_default_fused_subtract_exp", "tir.noalias": (bool)1, "target": llvm -keys=cpu -libs=dnnl -link-params=0 -mattr=avx,avx2,sse3,sse4.2,fma,avx512er,avx512f -mcpu=x86-64 -opt-level=3} {
  parallel (ax0.ax1.fused, 0, 12000) {
    for (ax2.outer, 0, 8) {
      for (ax2.inner.s, 0, 16) {
        if ((((ax2.outer*16) + ax2.inner.s) < 120)) {
          T_exp[(((ax0.ax1.fused*120) + (ax2.outer*16)) + ax2.inner.s)] = tir.exp((placeholder[(((ax0.ax1.fused*120) + (ax2.outer*16)) + ax2.inner.s)] - placeholder[ax0.ax1.fused]))
        }
      }
    }
  }
}

I still have a question why in the fused tir, the inner loop and outer loop are 16 and 8, instead of 100, 120, 120.

Lumos · February 19, 2024, 8:27am

tvm provides the register_pattern interface, which may offer a more convenient and flexible approach. Here is how to use it :

register_pattern("exp", OpPattern.OPAQUE, 11) # before relay.build()