[Relay FuseOps] not working for kInjective -> commReduce ops

hypercubestart · April 7, 2021, 6:58pm

For a simple module with a kInjective → commReduce op

def @main(%a: Tensor[(5, 5), float32]) -> Tensor[(25), float32] {
  %0 = reshape(%a, newshape=[25, 1]) /* from_string */ /* ty=Tensor[(25, 1), float32] */;
  sum(%0, axis=[1]) /* from_string */ /* ty=Tensor[(25), float32] */
}

the FuseOps pass outputs

def @main(%a: Tensor[(5, 5), float32]) -> Tensor[(25), float32] {
  %0 = fn (%p0: Tensor[(5, 5), float32], Primitive=1) -> Tensor[(25, 1), float32] {
    reshape(%p0, newshape=[25, 1]) /* from_string */ /* ty=Tensor[(25, 1), float32] */
  };
  %1 = %0(%a) /* ty=Tensor[(25, 1), float32] */;
  %2 = fn (%p01: Tensor[(25, 1), float32], Primitive=1) -> Tensor[(25), float32] {
    sum(%p01, axis=[1]) /* from_string */ /* ty=Tensor[(25), float32] */
  };
  %2(%1) /* ty=Tensor[(25), float32] */
}

thus, codegen will not fuse reshape and sum, creating 2 separate kernels, one for each op. It is obvious that reshape can be inlined with sum and will save the cost of an extra kernel launch.

Is there a reason that these ops aren’t fused in FuseOps?

cc: @masahi @MarisaKirisame @jroesch @tqchen

masahi · April 7, 2021, 9:00pm

hmm it’s not clear to me if we can generally fuse injective + reduce ops safely. GPU reduction, in particular, often needs to make multiple passes over inputs.

cc @altanh re gather + sum fusion in EmbeddingBag

hypercubestart · April 8, 2021, 9:27pm

I’m not sure I quite follow how multiple passes over inputs will cause fusion to be unsafe. Do reductions consume the memory location of its input?

masahi · April 8, 2021, 10:03pm

No, I was thinking maybe fused injective ops would be computed multiple times as we make multiple passes (“safely” was not the best word to describe my concern, sorry). The “recompute” would simply be a indexing math for injective ops, though, so that might not be too significant.

You could try modifying kInjective to kCommReduce and see what happens. I’m not completely sure if this is the right change (I haven’t looked at this code for some time), though.

github.com

apache/tvm/blob/813136401a11a49d6c15e6013c34dd822a5c4ff6/src/relay/transforms/fuse_ops.cc#L779-L784


} else if (group_node->pattern == kInjective || group_node->pattern == kTuple) {
  // defer injective fusion to second phase.
  // so conv2d always finishes fusing.
  if (phase != 1) continue;
  // Check if all path are injective.
  auto fcond = [](OpPatternKind kind, bool is_sink) { return kind <= kInjective; };

hypercubestart · April 9, 2021, 12:11am

for the record, it looks like modifying kInjective to kCommReduce here does the right thing

github.com

apache/tvm/blob/813136401a11a49d6c15e6013c34dd822a5c4ff6/src/relay/transforms/fuse_ops.cc#L784


          };
          if (CheckPath(graph_node, dom_node->parent->gnode, fcond)) {
            CommitFuse(graph_node, dom_node->parent->gnode);
          }
        }
      } else if (group_node->pattern == kInjective || group_node->pattern == kTuple) {
        // defer injective fusion to second phase.
        // so conv2d always finishes fusing.
        if (phase != 1) continue;
        // Check if all path are injective.
        auto fcond = [](OpPatternKind kind, bool is_sink) { return kind <= kInjective; };
        if (CheckPath(graph_node, dom_node->parent->gnode, fcond)) {
          CommitFuse(graph_node, dom_node->parent->gnode);
        }
      } else {
        // do nothing.
        ICHECK(group_node->pattern == kCommReduce);
      }
    }
  }
};

thanks for clarifying, it would be great if we had benchmarks for model performance with

current state of TVM
allowing fusion between kInjective → kCommReduce

to test your theory, but its not critical