We at Arm would like to introduce a new optimization technique to TVM which we refer to as ‘cascading’, which aims to optimize a network’s peak memory usage. While in many of the systems TVM targets today there is plentiful memory available, for the embedded devices that uTVM aims to target, working memory in the form of SRAM is at a premium. This means that if peak memory exceeds the size of the SRAM there will be a failure.
Cascading works by breaking operations up into a series of smaller sub-ops (acting on sub-tensors) which can be executed ‘deeply’ to avoid having to store large intermediate tensors. Let’s look at a simple example:
Say there are two NHWC convolutions (with weight format HWIO) one after the other, A → B. A has a kernel hxw of 3x3 and B of 1x1. In this small network, A greatly increases the number of channels and B subsequently decreases them. This is a fairly typical pattern in convolutional networks. Now we tabulate the tensors present in this network.
Operation | Input Tensor | Weights | Output Tensor | Output Size |
---|---|---|---|---|
Convolution A | (1, 26, 26, 3) | (3, 3, 3, 32) | (1, 24, 24, 32) | 18.4 kB |
Convolution B | (1, 24, 24, 32) | (1, 1, 32, 8) | (1, 24, 24, 8) | 3.9 kB |
We can see that the intermediate tensor (the output of A) is quite large compared to the final output tensor. If the available SRAM on the system were less than 18.4 kB, we would not be able to execute this network in a layer-by-layer fashion. But, we can note that to produce a single output element of B doesn’t require all of A’s output to be in memory due to the spatial locality of convolutions. We can therefore tile A, which we refer to as ‘striping’, such that it is computed in, for instance, 16x(1, 6, 6, 32) output stripes. B can act on these stripes to produce part of its output, then the intermediate stripe can be overwritten with the next one. This means now only 1/16th of the intermediate memory is required (or ~ 1.2 kB) significantly decreasing the peak memory of the network and enabling it to be run on the device.
While here I consider only the case of two operations, in principle this can happen through many. However, I have also ignored a subtlety by making convolution B a 1x1 convolution. If the kernel size is larger, there will be an overlap in data usage. This means it is not appropriate to entirely overwrite each stripe (unless you’re happy to recompute some elements). For this case, rolling buffers should be used to store the intermediate stripes which allow for partial overwrite of the unneeded data while retaining the important ‘edge’ data (see my question here: Creating store_at in TVM on what’s necessary to make this possible).
By making extensive use of this technique, the peak memory usage of several well-known networks (eg. mobilenet and vgg16) can be decreased to ~1/3rd of the initial value. It is therefore of considerable interest for uTVM where such a decrease can be the difference between supporting a network and not.
I’ve been doing some experiments recently into how this could be implemented in TVM and have found that the scheduling language allows me to reasonably simply construct such cascades (although it does not support rolling buffers). However, there is a conflict in this area because in order to schedule multiple convolutions together they must end up in the same primitive function. Further to this, there’s currently no concept of ‘hierarchical scheduling’ in TVM, so if we schedule for the cascade we can’t also take advantage of the schedules from TOPI/AutoTVM. This implies to me that a two-stage scheduling would be more appropriate, first the cascading to break the ops into sub-ops, then the performance scheduling which could take place on a sub-op by sub-op basis.
In this initial RFC I just wanted to introduce the technique to see if it’s something others have worked on or have any thoughts about the best method to integrate it within the stack. I’d also be interested to know if anyone else would find such a technique to be a useful feature. We intend to arrive at a more formal proposal eventually, but need to see some more detail on the new TensorIR before then.
If you have any further questions/comments I’d be happy to answer them, thanks!