How to avoid an op from being fused?

Hi, I encountered a fused op’s performance issue. tvmgen_default_fused_subtract_exp() runs 400 times, cost 6071ms which is bad. How can I avoid the exp() op from being fused with other op? I replaced the exp() with fast_exp(), but had no effect.

Any suggestion? Thanks.

Got the answer. Set TOpPattern kOpaque. Please correct me if I were wrong.

.set_attr<TOpPattern>("TOpPattern", kOpaque)

When I modified the TOpPattern of exp() op, it dumps error and prints:

terminate called after throwing an instance of 'tvm::runtime::InternalError'
  what():  [15:46:05] /data/tvm-0.7/src/ir/../node/attr_registry.h:111: 
An error occurred during the execution of TVM.
For more information, please see:
  Check failed: (p.second != plevel) is false: Attribute TOpPattern of exp is already registered with same plevel=10

  void UpdateAttr(const String& attr_name, const KeyType& key, runtime::TVMRetValue value,
                  int plevel) {
    std::pair<TVMRetValue, int>& p = op_map->data_[index];
    ICHECK(p.second != plevel) << "Attribute " << attr_name << " of " << key->AttrRegistryName()
                               << " is already registered with same plevel=" << plevel;

It seems we can’t modify the TOpPattern for a built-in op, right? What shall I do if I want to have exp() un-fused?

Hi, @wenxian.

Set TOpPattern kOpaque. Please correct me if I were wrong.

.set_attr<TOpPattern>("TOpPattern", kOpaque)

I haven’t tried this yet, but I would try the same approach.

When I modified the TOpPattern of exp() op, it dumps error and prints:

I cannot be certain without looking into the exact code, but it seems like attribute for exp() op is already registered somewhere at priority 10. When you register your operator with kOpaque, it might be good to try assigning higher priority.

will it be helpful to avoid exp fusion from other operator? If you have some tricky to accerate exp op individually, it will be a good choice to avoid exp fusion. If the exp doesn’t do fuse with other op, it still in the network, and will costs the same time. So I don’t get the points, why you try to avoid exp fusion.? Thanks for clarifying my confusion.

Thanks for the answers.

I found where the attribute for exp() op is registered at priority 10. It’s in python/tvm/relay/op/

register_shape_func("exp", False, elemwise_shape_func)

def register_shape_func(op_name, data_dependent, shape_func=None, level=10):
    """Register operator shape function for an op.

Now the progress is moving on.

I made a mistake on the tensor shape, with setting the correct tensor shape, the TVM exp() op’s performance is not as good as TF or numpy’s.

(1) In the fused op in questions, the tensor shape is [100, 120, 120], float32. Running 400 times, it costs 6.02s.

(2) TVM tir.exp() with the same tensor shape without subtract, it costs 5322ms. (which is bad)

(3) numpy exp() with the same tensor shape and the same amount of computation, it costs 1933ms.

So I think the problem is no longer a fused op one now. The tir.exp() op itself is slower than TF and numpy. I tested fast_exp(), the performance is similiar with tir.exp().

The Relay IR and TIR for the subtract exp fuse op is as following:

Relay IR:
  %450 = subtract(%448, %449) /* StatefulPartitionedCall/functional_1/activation_2/sub */ /* ty=Tensor[(100, 120, 120), float32] */;
  %451 = exp(%450) /* StatefulPartitionedCall/functional_1/activation_2/Exp */ /* ty=Tensor[(100, 120, 120), float32] */;

, GlobalVar(tvmgen_default_fused_subtract_exp): PrimFunc([placeholder, placeholder, T_exp]) attrs={"from_legacy_te_schedule": (bool)1, "global_symbol": "tvmgen_default_fused_subtract_exp", "tir.noalias": (bool)1, "target": llvm -keys=cpu -libs=dnnl -link-params=0 -mattr=avx,avx2,sse3,sse4.2,fma,avx512er,avx512f -mcpu=x86-64 -opt-level=3} {
  parallel (ax0.ax1.fused, 0, 12000) {
    for (ax2.outer, 0, 8) {
      for (ax2.inner.s, 0, 16) {
        if ((((ax2.outer*16) + ax2.inner.s) < 120)) {
          T_exp[(((ax0.ax1.fused*120) + (ax2.outer*16)) + ax2.inner.s)] = tir.exp((placeholder[(((ax0.ax1.fused*120) + (ax2.outer*16)) + ax2.inner.s)] - placeholder[ax0.ax1.fused]))

I still have a question why in the fused tir, the inner loop and outer loop are 16 and 8, instead of 100, 120, 120.