For question 1: absolutely it could.
For example, FuseTIR
a relax pass, can support simple op fusion, and it’s very simple to extract kernel with relax after applying schedule on them.
some tvm based tuner also enable efficient kernel tuning, for example, example: fuse gemm_layout_transform_bias. Try connect different ops with connect_tensor_graph
(or consecutive gemm within end2end).
arg1 = ladder_gemm(M, N, K, wmma_m, wmma_n, wmma_k)
arg2 = reshape(M, N, wmma_m, wmma_n)
arg3 = layout_transform(M, N)
arg4 = bias(1, M, N)
args = arg1
args = tuple(connect_tensor_graph(args, arg2, {arg2[0]:arg1[-1]}))
args = tuple(connect_tensor_graph(args, arg3, {arg3[0]:args[-1]}))
args = tuple(connect_tensor_graph(args, arg4, {arg4[0]:args[-1]}))