[TE] Using reshape without copy

JosseVanDelm · March 22, 2021, 5:04pm

Hello everyone,

I’m trying to prepare tensor data for a custom accelerator. To make sure the data layout is right for the accelerator I’ve looked into using the topi.reshape and topi.transpose operators.

However it looks to me that the reshape operation that is coming out of TVM is quite inefficient, as it basically makes nested for-loops to do a very inefficient element-per-element copy operation, as also pointed out in this linked thread.

For example if I have this code:

A = te.placeholder((16,))
B = topi.reshape(A,(4,4))
s = te.create_schedule(B.op)

The tensor data in A doesn’t really change, the only thing that I need to change are the handles to the tensor data:

>>> A.shape
[16]
>>> B.shape
[4, 4]

The fact that the data doesn’t change is also reflected in this schedule that is created:

primfn(placeholder_1: handle) -> ()
  attr = {"global_symbol": "main", "tir.noalias": True}
  buffers = {placeholder: Buffer(placeholder_2: Pointer(float32), float32, [16], [])}
  buffer_map = {placeholder_1: placeholder} {
  attr [T_reshape: Pointer(float32)] "storage_scope" = "global";
  allocate(T_reshape, float32, [16]);
  for (ax0: int32, 0, 4) {
    for (ax1: int32, 0, 4) {
      T_reshape[((ax0*4) + ax1)] = (float32*)placeholder_2[((ax0*4) + ax1)]
    }
  }
}

T_reshape[((ax0*4) + ax1)] = (float32*)placeholder_2[((ax0*4) + ax1)] just copies the value from placeholder_2 to T_reshape right? Is there a reason why this copy is necessary? I don’t see why it is, and I really would rather have it to only change the handles of the Tensors A and B instead of it having to actually perform the element-wise copy.

I had hoped TVM would optimize for this case (removing the copy), but it seems that it actually performs this copy in the compiled C code (i’m using a “C” target). It also makes a similar nonsense for-loops for a Conv2D operator with a specified padding of 0. (Or am I seeing this the wrong way and are there useful cases for such a copy?)

Is there any way to tell TVM to not do this copying of tensors?

I’m also linking these threads as they seem relevant, but don’t seem to offer a clear answer:

Thank you very much for your recommendations/comments!

mbaret · March 22, 2021, 5:32pm

Hi there,

I think what you’re observing is specific to having only a single operator. Generally things like reshape or even potentially transpose can be ‘inlined’ into a more expensive operator, and accordingly no copy takes place. TVM should handle this in the FuseOps pass (and perhaps a later TIR pass too). But when you have the operator on its own there’s nowhere for it to get inlined into and so the whole loop nest is generated.

JosseVanDelm · March 22, 2021, 6:06pm

Hi @mbaret , that is a great comment. Thanks, I will look further into this, as probably I will have to use the topi.transpose right after the topi.reshape, which does a less trivial copy operation and which will indeed also need those for-loops. I think even if there’s no TIR pass that makes them inline, I can probably make them inline manually.

However, I still think it’s quite weird that there’s no option to disable this in case it does not get fused or inlined properly. Especially when you have a schedule that needs a lot of tensorization (for coarse grained hardware primitives), I think there are a lot less opportunities for these kind of optimizations. Also the example of Conv2D with padding of zero still confuses me. Will TVM also fuse/inline the “copy for-loops” there with other previous operations?

Anyway I think I can move on with my work, so thank you very much for your quick and helpful answer!

mbaret · March 22, 2021, 10:25pm

It sounds like we’re working on quite a similar problem I believe the convolution schedules always inline padding explicitly, but when you just create the TE compute op directly from TOPI with a default schedule it’ll stick around. As to whether that’ll block fusion of other identity ops, it shouldn’t as it’s possible to inline multiple consecutive ops.

I’ve found it incredibly challenging to tensorize with more complex hardware primitives in TVM. For instance, if the primitive is ‘a convolution’ there was no way I could get tensorize to work. The next best option at that point for me at least was to directly write TIR passes to find and replace them. I’ve heard there’s some work on-going to improve tensorize in TVM so hopefully this gets better in future.

leeexyz · March 24, 2021, 1:45pm

@JosseVanDelm @mbaret I am working on the similar problem too! Hope this work will save us. [2101.08458] UNIT: Unifying Tensorized Instruction Compilation (arxiv.org)

fPecc · August 11, 2022, 2:59pm

Hi @JosseVanDelm and @mbaret ,

I am also experiencing a similar problem, did you find a solution? In my case, the last layer of a neural network needs an input which is a concatenation of multiple tensors, and each of these tensors is a reshape of other tensors. TVM is actually fusing this reshapes and concatenate operation together with other operators but is also generating code that copies data from one vector to the other.

This is the original generated code:

  /* These are the 5 copies I think are not needed */
  for (int32_t ax0_ax1_fused_ax2_fused = 0; ax0_ax1_fused_ax2_fused < 34848; ++ax0_ax1_fused_ax2_fused) {
    ((int8_t*)T_reshape)[ax0_ax1_fused_ax2_fused] = placeholder[ax0_ax1_fused_ax2_fused];
  }
  for (int32_t ax0_ax1_fused_ax2_fused1 = 0; ax0_ax1_fused_ax2_fused1 < 9248; ++ax0_ax1_fused_ax2_fused1) {
    ((int8_t*)T_reshape1)[ax0_ax1_fused_ax2_fused1] = placeholder1[ax0_ax1_fused_ax2_fused1];
  }
  for (int32_t ax0_ax1_fused_ax2_fused2 = 0; ax0_ax1_fused_ax2_fused2 < 2592; ++ax0_ax1_fused_ax2_fused2) {
    ((int8_t*)T_reshape2)[ax0_ax1_fused_ax2_fused2] = placeholder2[ax0_ax1_fused_ax2_fused2];
  }
  for (int32_t ax0_ax1_fused_ax2_fused3 = 0; ax0_ax1_fused_ax2_fused3 < 800; ++ax0_ax1_fused_ax2_fused3) {
    T_reshape3[ax0_ax1_fused_ax2_fused3] = placeholder3[ax0_ax1_fused_ax2_fused3];
  }
  for (int32_t ax0_ax1_fused_ax2_fused4 = 0; ax0_ax1_fused_ax2_fused4 < 288; ++ax0_ax1_fused_ax2_fused4) {
    T_reshape4[ax0_ax1_fused_ax2_fused4] = placeholder4[ax0_ax1_fused_ax2_fused4];
  }
  /* These are the loops that do the concatenation */
  for (int32_t j = 0; j < 34848; ++j) {
    ((int8_t*)concatenate_ext1)[j] = ((int8_t*)T_reshape)[j];
  }
  for (int32_t j1 = 0; j1 < 9248; ++j1) {
    ((int8_t*)concatenate_ext1)[(j1 + 34848)] = ((int8_t*)T_reshape1)[j1];
  }
  for (int32_t j2 = 0; j2 < 2592; ++j2) {
    ((int8_t*)concatenate_ext1)[(j2 + 44096)] = ((int8_t*)T_reshape2)[j2];
  }
  for (int32_t j3 = 0; j3 < 800; ++j3) {
    ((int8_t*)concatenate_ext1)[(j3 + 46688)] = T_reshape3[j3];
  }
  for (int32_t j4 = 0; j4 < 288; ++j4) {
    ((int8_t*)concatenate_ext1)[(j4 + 47488)] = T_reshape4[j4];
  }
  /* This is a cast, which could be fused into the concatenation! */
  for (int32_t ax0_ax1_fused_ax2_fused5 = 0; ax0_ax1_fused_ax2_fused5 < 47776; ++ax0_ax1_fused_ax2_fused5) {
    ((int32_t*)T_cast)[ax0_ax1_fused_ax2_fused5] = ((int32_t)((int8_t*)concatenate_ext1)[ax0_ax1_fused_ax2_fused5]);
  }

This is what I thought TVM would generate (I modified the code by hand, notice how the concatenation and casting is done without unnecessary copies, achieving ~16 ms improvement for this fused operator):

  for (int32_t j = 0; j < 34848; ++j) {
    ((int32_t*)T_cast)[j] = (int32_t)(placeholder[j]);
  }
  for (int32_t j1 = 0; j1 < 9248; ++j1) {
    ((int32_t*)T_cast)[(j1 + 34848)] = (int32_t)(placeholder1[j1]);
  }
  for (int32_t j2 = 0; j2 < 2592; ++j2) {
    ((int32_t*)T_cast)[(j2 + 44096)] = (int32_t)(placeholder2[j2]);
  }
  for (int32_t j3 = 0; j3 < 800; ++j3) {
    ((int32_t*)T_cast)[(j3 + 46688)] = (int32_t)(placeholder3[j3]);
  }
  for (int32_t j4 = 0; j4 < 288; ++j4) {
    ((int32_t*)T_cast)[(j4 + 47488)] = (int32_t)(placeholder4[j4]);
  }

JosseVanDelm · August 11, 2022, 4:01pm

Hi fPecc,

We are currently using an entirely different approach for mapping the accelerator and thus this issue is not something I looked into further…

Sorry!