As discussed in the June 27, 2023, community meeting, here is a more concrete proposal for implementing in-place operations.
The basic mechanism for implementing in-place operations would be to replace operator calls with TIR PrimFuncs that compute in-place. We could perform such a replacement using a special operator call_tir_inplace, which would be similar to call_tir:
call_tir_inplace(gvar, args, out_idxs, packed_ints):
-
gvaris the global variable identifying a TIRPrimFunc. -
out_idxsindicate which of the args should be considered “outputs” - The
PrimFuncwill be invoked with the argumentsargs, mutating the arguments whose indices matchout_idxs. The operator will return aliases of those modified arguments.
As mentioned in the community meeting, having call_tir_inplace as a separate operator has the benefit of allowing us to treat it as pure and have it inside DataflowBlocks and also allow for special handling in passes like fusion or layout propagation.
The tricky problem with in-place operations, as we will explore below, is that detecting opportunities to make ops in-place should ideally be done early in compilation (when dataflow blocks are still intact) but implementing them may require low-level changes to memory allocations.
Simple Case
Let’s suppose that we intend to implement an in-place addition, where we have z = add(x, y). We can replace this call with an in-place version if the following conditions are met:
- Whichever argument is chosen (see condition 3) to hold the result must not be live after that point in the program
- No alias of the argument chosen to hold the result may be live after that point in the program
- Either
xoryhas the same type (especially shape) as the output.
If these conditions are met, we can replace this call with z = call_tir_inplace(inplace_add, [x, y], out_idxs=[0], []) (the choice of which argument should be the output depends on which one matches the type). The inplace_add PrimFunc would be manually written. If the conditions are not met, then do nothing and legalize the add later.
However, it should be noted that checking these conditions can be complicated, as there are many actions that could potentially create aliases. For example, we must conservatively assume that any packed call could alias any of its arguments or, if you want to consider some truly adversarial examples, alias any value ever passed to a packed func, per the below example. With the same example, we can also see how a packed call is a nightmare for liveness analysis, since the argument passed to packed1 is still live if packed2 is called later in the program, even though the compiler cannot know that there is any relationship between the two. Thus, packed calls are “black holes” for both alias and liveness analysis: The result of a packed call can be the alias of any value ever passed to any packed call.
GLOBAL_VAR = None
def packed1(x):
nonlocal GLOBAL_VAR
GLOBAL_VAR = x
return x
# the result of packed2 aliases a value that was passed to packed1!
def packed2(y):
return GLOBAL_VAR
Of course, we do not have to permit these kinds of adversarial examples, but if we want to define restrictions on what packed calls should be allowed to do, we need to be very up-front with them, as there is no practical way the compiler can enforce such restrictions. Unfortunately, I could imagine that the pattern of setting or reading a global variable via PackedFuncs could legitimately be needed, so there could be issues with adding such restrictions.
Even if we analyze only values within DataflowBlocks that are DataflowVars, there is no restriction on assigning a DataflowVar to a variable from outside the block, which brings all the same possible issues of aliasing…
Most likely, we will need some kind of escape hatch for liveness and alias analysis, some annotation a user can give to insist to the compiler that a variable is safe to use despite having been passed to a PackedCall, e.g., a non_aliased(x) operator or a call_packed_nonaliasing(func, *args, sinfo_args) operator. These would definitely be pretty finnicky details for users to manage, though the opaque nature of PackedFuncs gives us little choice here.
When Shapes Do Not Match
In the above example of adding in-place, we may consider the situation where we have z = add(x, y) and neither x nor y has the same shape as z. This would be the case if broadcasting is taking place. It would still be possible to perform the addition in-place if either x or y has an underlying storage that is large enough to store the broadcasted result. This means that handling such a case would require reasoning about the sizes of allocated regions of memory rather than only tensor shapes.
The in-place optimizations proposed at the June 27, 2023, community meeting of eliding concatenations and splits (ensuring that concatenated tensors are allocated in a single storage so that the concatenation itself is a no-op, or ensuring that the tensors created by splitting are all just offsets within the same storage; both optimizations can be done only if the inputs to the operators will not be live afterwards) similarly require reasoning at the level of memory allocation.
This poses a bit of a problem from the perspective of phase ordering, since the simple case in the previous section does not need to reason about storages but cases like split, concat, and non-matching shapes do. However, reasoning about whether the aliasing and liveness conditions are satisfied at a low level can be more complicated, so it would be preferable to handle those analyses earlier in compilation.
One solution would be to do the liveness and alias analyses early, replace the operator with a different version that is marked “in-place,” and later in compilation replace the special operator and its inputs with the appropriate versions (performing the memory allocations as specified and an in-place TIR call for the operator).
This approach would require two passes:
- A pass early in compilation that identifies opportunities to perform operators in-place. This would be responsible for doing the liveness and alias analyses. In the simple case where the shapes of the input and output match, the in-place version can be inserted immediately. In cases where that is not so, the operator can be replaced with a version that indicates that it is a candidate for being done in-place (e.g.,
add_inplaceforadd). - A pass later in compilation that analyzes the candidates for in-place versions and ensures that the memory allocations are performed appropriately to facilitate these in-place versions (this could perhaps be done as part of
StaticPlanBlockMemory). This would be where the proposed optimizations forsplitandconcatwould be done.
An alternative approach might be to do all analysis and replacements in one pass, which would have to come late in compilation since the more complex cases involve dealing with explicit memory allocations. However (see the phase-ordering discussion), exactly where in the phase ordering this pass should be done is unclear: It should come before legalization, since legalization would get rid of all the Relax operators, but it would also have to deal with explicit memory allocations, which is normally handled after legalization.
A third possibility would be to add another parameter to call_tir_inplace that gives the size of the output. The operators could thus be replaced immediately with call_tir_inplace when appropriate and StaticPlanBlockMemory would only have to handle the single case of call_tir_inplace. This would solve certain cases, like the broadcasting add example, but it is not clear how this would account for the concat or split cases. We could perhaps address those through some manner of specifying that the arguments or results should be contiguous in memory or share an underlying storage (would a single flag suffice?).
Summary
The two-staged approach seems like it would be the least difficult to design, though that also poses some phase-ordering difficulties (StaticPlanBlockMemory would become a bit overloaded in terms of its intended functionality). Specifying memory allocation details in call_tir_inplace might also be a reasonably simple solution. In any case, it is clear that dealing with memory allocations as a part of handling in-place operations is a source of complexity and I would welcome further discussion on how the implementation should be approached, particularly from the perspective of how memory allocations are represented in Relax.