For a simple module with a kInjective → commReduce op
def @main(%a: Tensor[(5, 5), float32]) -> Tensor[(25), float32] {
%0 = reshape(%a, newshape=[25, 1]) /* from_string */ /* ty=Tensor[(25, 1), float32] */;
sum(%0, axis=[1]) /* from_string */ /* ty=Tensor[(25), float32] */
}
the FuseOps
pass outputs
def @main(%a: Tensor[(5, 5), float32]) -> Tensor[(25), float32] {
%0 = fn (%p0: Tensor[(5, 5), float32], Primitive=1) -> Tensor[(25, 1), float32] {
reshape(%p0, newshape=[25, 1]) /* from_string */ /* ty=Tensor[(25, 1), float32] */
};
%1 = %0(%a) /* ty=Tensor[(25, 1), float32] */;
%2 = fn (%p01: Tensor[(25, 1), float32], Primitive=1) -> Tensor[(25), float32] {
sum(%p01, axis=[1]) /* from_string */ /* ty=Tensor[(25), float32] */
};
%2(%1) /* ty=Tensor[(25), float32] */
}
thus, codegen will not fuse reshape and sum, creating 2 separate kernels, one for each op. It is obvious that reshape can be inlined with sum and will save the cost of an extra kernel launch.
Is there a reason that these ops aren’t fused in FuseOps?