Good point. If we can determine all possible “micro-kernels” ahead of time, I suppose it’s possible to decompose a tensor into a variable number of sub-tensors (in this case we might not use optimizations like horizontal fusion) and dispatch the computation corresponding to each of these sub-tensors to pre-compiled “micro-kernels”.
An interesting idea is thinking about the transformation at higher-level IRs such as relax. For example, the Multi-Head Attention applied to ragged tensors (variable length 2d array) can be dispatched to pre-compiled operators in different ways: e.g. by rows, and the number of rows does not need to be pre-determined.