[RFC] 'Cascade' Scheduling

We at Arm would like to introduce a new optimization technique to TVM which we refer to as ‘cascading’, which aims to optimize a network’s peak memory usage. While in many of the systems TVM targets today there is plentiful memory available, for the embedded devices that uTVM aims to target, working memory in the form of SRAM is at a premium. This means that if peak memory exceeds the size of the SRAM there will be a failure.

Cascading works by breaking operations up into a series of smaller sub-ops (acting on sub-tensors) which can be executed ‘deeply’ to avoid having to store large intermediate tensors. Let’s look at a simple example:

Say there are two NHWC convolutions (with weight format HWIO) one after the other, A → B. A has a kernel hxw of 3x3 and B of 1x1. In this small network, A greatly increases the number of channels and B subsequently decreases them. This is a fairly typical pattern in convolutional networks. Now we tabulate the tensors present in this network.

Operation Input Tensor Weights Output Tensor Output Size
Convolution A (1, 26, 26, 3) (3, 3, 3, 32) (1, 24, 24, 32) 18.4 kB
Convolution B (1, 24, 24, 32) (1, 1, 32, 8) (1, 24, 24, 8) 3.9 kB

We can see that the intermediate tensor (the output of A) is quite large compared to the final output tensor. If the available SRAM on the system were less than 18.4 kB, we would not be able to execute this network in a layer-by-layer fashion. But, we can note that to produce a single output element of B doesn’t require all of A’s output to be in memory due to the spatial locality of convolutions. We can therefore tile A, which we refer to as ‘striping’, such that it is computed in, for instance, 16x(1, 6, 6, 32) output stripes. B can act on these stripes to produce part of its output, then the intermediate stripe can be overwritten with the next one. This means now only 1/16th of the intermediate memory is required (or ~ 1.2 kB) significantly decreasing the peak memory of the network and enabling it to be run on the device.

While here I consider only the case of two operations, in principle this can happen through many. However, I have also ignored a subtlety by making convolution B a 1x1 convolution. If the kernel size is larger, there will be an overlap in data usage. This means it is not appropriate to entirely overwrite each stripe (unless you’re happy to recompute some elements). For this case, rolling buffers should be used to store the intermediate stripes which allow for partial overwrite of the unneeded data while retaining the important ‘edge’ data (see my question here: Creating store_at in TVM on what’s necessary to make this possible).

By making extensive use of this technique, the peak memory usage of several well-known networks (eg. mobilenet and vgg16) can be decreased to ~1/3rd of the initial value. It is therefore of considerable interest for uTVM where such a decrease can be the difference between supporting a network and not.

I’ve been doing some experiments recently into how this could be implemented in TVM and have found that the scheduling language allows me to reasonably simply construct such cascades (although it does not support rolling buffers). However, there is a conflict in this area because in order to schedule multiple convolutions together they must end up in the same primitive function. Further to this, there’s currently no concept of ‘hierarchical scheduling’ in TVM, so if we schedule for the cascade we can’t also take advantage of the schedules from TOPI/AutoTVM. This implies to me that a two-stage scheduling would be more appropriate, first the cascading to break the ops into sub-ops, then the performance scheduling which could take place on a sub-op by sub-op basis.

In this initial RFC I just wanted to introduce the technique to see if it’s something others have worked on or have any thoughts about the best method to integrate it within the stack. I’d also be interested to know if anyone else would find such a technique to be a useful feature. We intend to arrive at a more formal proposal eventually, but need to see some more detail on the new TensorIR before then.

If you have any further questions/comments I’d be happy to answer them, thanks!

4 Likes

Thanks @matt-arm This is certainly quite interesting for the case of fused depthwise and 3x3 conv2d.

As a middle ground, perhaps it is worth while to think about simply tiling the output computations and use the compute_at at the intermediate tiles, it would of course less ideal than a rolling buffer, but would still create something similar in effect (as long as recomputation is minimum at the edge)

Given the receptive field can quickly expand with 3x3 conv, it may not make sense to create a very deep pathways.

Thanks for the feedback :slight_smile: Tiling the output computations + compute_at is actually exactly what I’ve been doing to prototype this - and you’re right that for a sufficiently large tile the recompute isn’t particularly bad. I think the rolling buffers aren’t immediately essential, but they would be a very beneficial future optimization.

In our testing/prototyping we have found profitable cascades of 5+ ops, particularly in both mobilenet-type architectures and super-resolution networks. Determining whether continuing a cascade is profitable would be one of the jobs of the cascading algorithm.

My major concern integrating this is that convolution-type operations are always on their own in primitive functions. For my experiments I’m currently lowering the whole graph to a single TE but this will not work alongside the current TOPI integration which expects ‘master ops’ to determine the schedule. In essence I would like to do a hierarchical scheduling, first of the cascades and the second of the ops themselves.

Glad to see a proposal with such functionality.

I have to admit that I had also done something similar to what you are proposing (at least at the TE level).

One problem I had was enlargment of the tensor iteration domains ([TVM Scheduling] Split factors missmatch to original extent). This had the problem of many inner loops being mostly “out of original domain” which was really not performant or a huge explotion of “program code” to statically solve all those ifs.

Another problem you would face are the limitations of FuseOps. Without any change here, I dont know how you would “automatically” get those composed stages you will need in order to schedule them.

Nonetheless, for specific configurations of layers, I think what you propose is a reasonable way of processing. Rolling buffers would make it even sweeter.

side note

This kind of reminds me of the “graph tunner” infrastructure.

Maybe there should be a more general infrastructure for doing “graph level tunning” (I think the current one only tunes w.r.t. different layout transformations).

From your response to TQ I feel that the current limitation is mainly from the TOPI implementation, which fixed the granularity of conv2d schedule. This limitation, however, can potentially be resolved by auto-scheduler (Ansor). In fact, we AWS is attempting to use Ansor to perform more a aggressive scheduling on fused ops (e.g., two conv2ds), and one core idea is exactly leveraging compute_at. We’ve submitted the proposal and initial results of this project to TVM conference and hopefully we could have a chance to share it there.

The limitation is perhaps more with TE/TIR than it is TOPI, in that currently all the scheduling decisions need to happen together. The proposed changes with TensorIR would lift that constraint, but for it to actually be useful the FuseOps pass would have to become a TIR pass rather than a Relay one, otherwise the graph is already split up before we even get the chance to schedule it.

I did consider whether Ansor may be a good fit for this, however there seemed to be a few main issues. Firstly, for us this technique is mostly to reduce peak memory requirements (however it can also be beneficial to performance from reduction in bandwidth). Ansor currently singularly optimizes for performance, whereas instead we’d want to keep a pareto-frontier of performance/memory usage. Second, we’d prefer not to have to use auto-tuning approaches except where absolutely necessary. Finally, this involves considering many different ways to break a graph down into sets of cascaded subgraphs but my understanding is that Ansor does (or maybe will) only use a fixed set of rules to create the subgraphs.

Perhaps there are solutions to some of these problems, but I currently envisioned that the cascading would break the graph into these interleaved ‘sub-ops’ (acting on sub-tensors) and Ansor could then subsequently optimise each of these for performance.

@tqchen, So this makes me wonder – what are the exact reasons that we need to maintain the relay abstraction until upto the graph runtime ? As @matt-arm mentions, I quite like the idea of making fuse-ops a TIR (the improved one with blocks) pass because currently its forward guessing the semantics of the operator to know whether its fusible by categorizing them through obvious features (e.g., fusion of element-wise ops). Moreover, this links back to the old question of mine related to the new TIR proposal, would we benefit by having the whole graph / basic blocks in TIR after relay optimizations ?

I think it has things to do with the complexity we want to manage in the low-level. While it can be attractive to put a whole graph in a single TIR block, it can inevitably increase the amount of effort to support scheduling for such kind of blocks.

The relay represrentation is also useful to form high-level knowledges, e.g. cut a subgraph and use winograd to implement conv2d which is not available in TIR.

I think we should still rely on the partitioning of the subfunctions in relay level, but improve the flexibility of such partitioner so that we could get out patterns like deptwise-conv -> conv2d and even full graph if necessary.

As we can see are two extreme point of design:

  • D0: Relying everything on TIR (good for low level decisions)
  • D1: Relying mostly on relay for each single operator (good for high-level decisions).

Both points have their weakness. Our current solution is somewhere in the middle, by relying relay to get a subfunction that feeds into TIR. I believe the most important thing is to build solution that can freely move between D0 and D1. Additionally, it might be helpful to expose the TIR level info to the graph level, e.g. use the compute decl of an op to derive its semantics and expose them in the graph level.

In this case, having the ability to customize the fusor (via pattern language) would opens up doors for such move. Such a blended solution is likely going to be more pratically.

I should emphasize that while I’ve used 2 convolutions to illustrate a situation in which this technique is useful, it actually generalizes to any operators which have some degree of locality to them (eg. max pool). In that sense, we’re not interested in matching particular well-defined patterns but instead need full vision of the dataflow region of the graph.

My intention is to utilize the BYOC passes (particularly MergeCompilerRegions) to create these large TE graphs for an initially custom target. We can then start having a custom lowering flow that uses the cascading technique as a proof-of-concept even though it will not integrate well with TOPI/AutoTVM. My hope with TensorIR would be that later we may be able to target Ansor at TIR ‘blocks’.

1 Like

That is a good suggestion! @tqchen and it presents a good compromise for us to see the complexities and opportunities that would open up by being able to express the multi-op TIR blocks (hierarchical TIR blocks ?).