In some hardware, conv is supported in block directly. But for large data, it should be split in H and W direction with overlap, like the following figure. How to do this overlapped split?
Well you can look at the VTA conv2d optimization tutorial. They do blocking with overlap.
An easy way to think about it is to go from the outputs to the inputs. In other words, if you block the output feature map H and W and you compute which input feature pixel you need (dependent on stride and size of kernel, etc) then you see that for neighboring output blocks you get regions of overlaps in the input H and W.
It seems work. We will try it.
Thanks a lot!