I was looking for something like this a couple of months back, but to avail.
It would be useful to have, I’m just unsure what changes would be needed. In a sense we have in-place operations when we fuse conv2d+relu layers (afaik), since we apply the ReLU on the accumulated value when it is ready.
Doing this requires a specialised pass (though I haven’t read the code for it). One could in-principle do something similar with your use-case. But it’s more interesting to consider what a general solution would look like, that could be easily used at the Python te.compute
expression level.