Optimizing data layouts

hey all — i am trying to understand how TVM optimizes for data layout. here’s my current understanding:

  • certain hardware prefers certain layouts for certain operators (e.g. NCHWc for conv2d on x86).
  • TVM can optimize around operators that have “hard” layout constraints (like conv2d) and try to conform the rest of the operators to those layouts, to try and minimize layout changes. i’ve at least seen Relay passes for this; i’m unsure if it also happens at lower layers.

i have some questions:

  • is optimizing data layout a solved problem? if not, where are the current pain points?
  • is it common that you’re trying to conform layouts to fit the expectations of the existing kernels, because writing new kernels with new layouts is a pain? or are people willing to explore new layouts, if they are potentially beneficial?

I ask as I’ve recently been experimenting with a small language which is built to search over the space of layout transformations. It was originally designed for converting neural networks into high-level accelerator designs, but I’m wondering if it might also be useful for exploring how we can optimize data layouts within or between kernel invocations!

1 Like

Here is just my understanding and I believe others can tell more.

Optimizing data layouts is a bit tricky, because it changes the spec (i.e., input and output shapes) of a compute. Since schedule primitives schedule “the given compute” by their definition, optimizing data layout involves not only schedule optimization but also the strategy of selecting alternative computes. As a result, basically every new layout needs a new compute (kernel). For example, people wrote an alternative compute, NCHWc, to replace the NCHW compute for Conv2D on x86, and we need to use a separate pass (i.e., AlterOpLayout) to determine whether the compute replacement is legal.

In the content of optimizing data layout problem, @yzhliu and @kevinthesun made graph tuner that uses dynamic programming to achieve the optimal solution of determining [x] of the NCHW[x]c layouts on x86. So this is basically a solved problem on x86 in terms of research novelty.

On the other hand, IMHO, it would be interesting to discuss how can we extend the schedule primitive to cover data layout optimization, so that 1) we don’t need to manually write a compute for each data layout, and 2) auto-tuning/auto-scheduling framework implementations can be more straightforward. A very rough idea is proposing a new schedule primitive to transform data layouts, and having corresponding mechanisms to 1) insert layout transform ops to maintain the functional correctness, and 2) remove unnecessary layout transform ops to reduce the overheads.

1 Like

Thanks for the quick reply, this is all awesome information.

RE: changing the spec: yes, this makes sense; I’m actually writing a new version of my language which abstracts away layout from algorithm, and treats tensors as sets (rather than lists) of dimensions. I have no idea whether this is actually useful, but it was inspired by exactly the fact that the spec changes even though, at some level, the algorithm is still the same.

RE: the last paragraph:

I.e., computes could be automatically generated? It makes sense that that would be useful. However, I have an even more basic question: how do we determine what layout (and consequently, what compute) we would like automatically generated? Are they just suggested by people who understand the functioning of the hardware?

how do we determine what layout (and consequently, what compute) we would like automatically generated? Are they just suggested by people who understand the functioning of the hardware?

It seems to me that this is a general question to tuning, like we can ask “how do we determine what tile size for a specific axis”? Since we already have AutoTVM and AutoScheduler, as long as the schedule primitive is able to let users specify the desire layout, we can extend them to like:

AutoTVM: Let users specify a list of candidate layouts (with certain hardware knowledge to figure out a concise candidate list) and search for the best one.

AutoScheduler: Add a new mutation rule that mutates the layout in the evolutionary search.