Couldn't find vta codegen file

hht · October 6, 2020, 9:07am

Transformation process of single-layer convolutional network at compile time.

First, the first step is to create a network model with only convolution and print Relay IR.

def @main(%data: Tensor[(1, 16, 224, 224), float32], %weight: Tensor[(16, 16, 3, 3), float32]) -> Tensor[(1, 16, 224, 224), float32] {
  nn.conv2d(%data, %weight, padding=[1, 1, 1, 1], channels=16, kernel_size=[3, 3]) /* ty=Tensor[(1, 16, 224, 224), float32] */
}

Only nn.conv2d means that it is a single-layer network, which is also our requirement to start with a simple network. The following is to deploy this layer of network to the VTA, so

The second step is to quantify and print Relay IR

def @main(%data: Tensor[(1, 16, 224, 224), float32]) -> Tensor[(1, 16, 224, 224), float32] {
  %0 = multiply(%data, 16f /* ty=float32 */) /* ty=Tensor[(1, 16, 224, 224), float32] */;
  %1 = round(%0) /* ty=Tensor[(1, 16, 224, 224), float32] */;
  %2 = clip(%1, a_min=-127f, a_max=127f) /* ty=Tensor[(1, 16, 224, 224), float32] */;
  %3 = cast(%2, dtype="int8") /* ty=Tensor[(1, 16, 224, 224), int8] */;
  %4 = nn.conv2d(%3, meta[relay.Constant][0] /* ty=Tensor[(16, 16, 3, 3), int8] */, padding=[1, 1, 1, 1], channels=16, kernel_size=[3, 3], out_dtype="int32") /* ty=Tensor[(1, 16, 224, 224), int32] */;
  %5 = add(%4, 256 /* ty=int32 */) /* ty=Tensor[(1, 16, 224, 224), int32] */;
  %6 = right_shift(%5, 9 /* ty=int32 */) /* ty=Tensor[(1, 16, 224, 224), int32] */;
  %7 = clip(%6, a_min=-127f, a_max=127f) /* ty=Tensor[(1, 16, 224, 224), int32] */;
  %8 = cast(%7, dtype="int8") /* ty=Tensor[(1, 16, 224, 224), int8] */;
  %9 = annotation.stop_fusion(%8) /* ty=Tensor[(1, 16, 224, 224), int8] */;
  %10 = cast(%9, dtype="float32") /* ty=Tensor[(1, 16, 224, 224), float32] */;
  multiply(%10, 0.0625f /* ty=float32 */) /* ty=Tensor[(1, 16, 224, 224), float32] */
}

You can see that quantization adds operators such as multiply, round, clip, cast, ʻadd, and right_shift` to the network

The third step is to change the memory layout of NCHW to NCHW16n16c, and print Relay IR

fn (%data: Tensor[(1, 16, 224, 224), float32]) -> Tensor[(1, 16, 224, 224), float32] {
  %0 = multiply(%data, 16f /* ty=float32 */) /* ty=Tensor[(1, 16, 224, 224), float32] */;
  %1 = reshape(%0, newshape=[1, 1, 1, 16, 224, 224]) /* ty=Tensor[(1, 1, 1, 16, 224, 224), float32] */;
  %2 = transpose(%1, axes=[0, 2, 4, 5, 1, 3]) /* ty=Tensor[(1, 1, 224, 224, 1, 16), float32] */;
  %3 = round(%2) /* ty=Tensor[(1, 1, 224, 224, 1, 16), float32] */;
  %4 = clip(%3, a_min=-127f, a_max=127f) /* ty=Tensor[(1, 1, 224, 224, 1, 16), float32] */;
  %5 = cast(%4, dtype="int8") /* ty=Tensor[(1, 1, 224, 224, 1, 16), int8] */;
  %6 = reshape(meta[relay.Constant][0] /* ty=Tensor[(16, 16, 3, 3), int8] */, newshape=[1, 16, 1, 16, 3, 3]) /* ty=Tensor[(1, 16, 1, 16, 3, 3), int8] */;
  %7 = transpose(%6, axes=[0, 2, 4, 5, 1, 3]) /* ty=Tensor[(1, 1, 3, 3, 16, 16), int8] */;
  %8 = nn.conv2d(%5, %7, padding=[1, 1, 1, 1], channels=16, kernel_size=[3, 3], data_layout="NCHW1n16c", kernel_layout="OIHW16o16i", out_dtype="int32") /* ty=Tensor[(1, 1, 224, 224, 1, 16), int32] */;
  %9 = add(%8, 256 /* ty=int32 */) /* ty=Tensor[(1, 1, 224, 224, 1, 16), int32] */;
  %10 = right_shift(%9, 9 /* ty=int32 */) /* ty=Tensor[(1, 1, 224, 224, 1, 16), int32] */;
  %11 = clip(%10, a_min=-127f, a_max=127f) /* ty=Tensor[(1, 1, 224, 224, 1, 16), int32] */;
  %12 = cast(%11, dtype="int8") /* ty=Tensor[(1, 1, 224, 224, 1, 16), int8] */;
  %13 = copy(%12) /* ty=Tensor[(1, 1, 224, 224, 1, 16), int8] */;
  %14 = annotation.stop_fusion(%13) /* ty=Tensor[(1, 1, 224, 224, 1, 16), int8] */;
  %15 = transpose(%14, axes=[0, 4, 1, 5, 2, 3]) /* ty=Tensor[(1, 1, 1, 16, 224, 224), int8] */;
  %16 = reshape(%15, newshape=[1, 16, 224, 224]) /* ty=Tensor[(1, 16, 224, 224), int8] */;
  %17 = cast(%16, dtype="float32") /* ty=Tensor[(1, 16, 224, 224), float32] */;
  multiply(%17, 0.0625f /* ty=float32 */) /* ty=Tensor[(1, 16, 224, 224), float32] */
}

You can see that the memory layout transformation adds operators such as reshape, transpose, and copy to the network.

The preparations for Relay IR have been completed, and now start to compile, so

The fourth step is to add the pass of PrintIR, print the transformation during compilation, you can see a warning

Cannot find config for target=ext_dev -keys=vta,cpu -device=vta -model=sim_1x16_i8w8a32_15_15_18_17, workload=('conv2d_packed.vta', ('TENSOR', (1, 1, 224, 224, 1, 16), 'int8'), ('TENSOR', (1, 1, 3, 3, 16, 16), 'int8'), (1, 1), (1, 1, 1, 1), (1, 1), 'NCHW1n16c', 'int32'). A fallback configuration is used, which may bring great performance regression.

This warning is to remind our network that there is no corresponding tuning scheduling parameter in workload = conv2d_packed.vta. This is because there is no corresponding auto tuning parameter in tophub. In fact, the real reason is that I did not write to search for tuning parameters from tophub. The sentence, because it is not available, our purpose is to analyze the VTA compilation process

By adding the pass of PrintIR, you can see that the compiled module is divided according to four fused_func,

The first func->fused_transpose_reshape_cast_multiply, because the arrangement transformation is done after the first multiply, so

[04:55:57] /home/hht/tvm/src/ir/transform.cc:507: PrintIR():
#[version = "0.0.5"]
primfn(placeholder_1: handle, T_multiply_1: handle) -> ()
  attr = {"global_symbol": "fused_transpose_reshape_cast_multiply", "tir.noalias": True}
  buffers = {T_multiply: Buffer(T_multiply_2: Pointer(float32), float32, [1, 16, 224, 224], []),
             placeholder: Buffer(placeholder_2: Pointer(int8), int8, [1, 1, 224, 224, 1, 16], [])}
  buffer_map = {placeholder_1: placeholder, T_multiply_1: T_multiply} {
  attr [T_multiply] "realize_scope" = "";
  realize(T_multiply, [0:1, 0:16, 0:224, 0:224], True {
    for (ax0.ax1.fused: int32, 0, 16) "parallel" {
      for (ax2: int32, 0, 224) {
        for (ax3.outer: int32, 0, 14) {
          for (ax3.inner: int32, 0, 16) "vectorized" {
            T_multiply[floordiv(ax0.ax1.fused, 16), floormod(ax0.ax1.fused, 16), ax2, (ax3.inner + (ax3.outer*16))] = (cast(float32, placeholder[0, 0, floormod(floordiv(((((((floordiv(ax0.ax1.fused, 16)*16) + floormod(ax0.ax1.fused, 16))*224) + ax2)*224) + (ax3.inner + (ax3.outer*16))), 224), 224), floormod(((((((floordiv(ax0.ax1.fused, 16)*16) + floormod(ax0.ax1.fused, 16))*224) + ax2)*224) + (ax3.inner + (ax3.outer*16))), 224), 0, floormod(floordiv(floordiv(((((((floordiv(ax0.ax1.fused, 16)*16) + floormod(ax0.ax1.fused, 16))*224) + ax2)*224) + (ax3.inner + (ax3.outer*16))), 224), 224), 16)])*0.0625f32)
          }
        }
      }
    }
  })
}

You can see that it is realize(T_multiply, [0:1, 0:16, 0:224, 0:224] before the arrangement transformation or the four-dimensional NCHW.

The second func->fused_copy,

[04:55:57] /home/hht/tvm/src/ir/transform.cc:507: PrintIR():
#[version = "0.0.5"]
primfn(placeholder_1: handle, T_identity_1: handle) -> ()
  attr = {"global_symbol": "fused_copy", "tir.noalias": True}
  buffers = {T_identity: Buffer(T_identity_2: Pointer(int8), int8, [1, 1, 224, 224, 1, 16], []),
             placeholder: Buffer(placeholder_2: Pointer(int8), int8, [1, 1, 224, 224, 1, 16], [])}
  buffer_map = {placeholder_1: placeholder, T_identity_1: T_identity} {
  attr [T_identity] "realize_scope" = "";
  realize(T_identity, [0:1, 0:1, 0:224, 0:224, 0:1, 0:16], True {
    for (ax0.ax1.fused.ax2.fused: int32, 0, 224) "parallel" {
      for (ax3: int32, 0, 224) {
        for (ax5.inner: int32, 0, 16) "vectorized" {
          T_identity[floordiv(ax0.ax1.fused.ax2.fused, 224), 0, floormod(ax0.ax1.fused.ax2.fused, 224), ax3, 0, ax5.inner] = placeholder[floordiv(ax0.ax1.fused.ax2.fused, 224), 0, floormod(ax0.ax1.fused.ax2.fused, 224), ax3, 0, ax5.inner]
        }
      }
    }
  })
}

It can be seen that the realize(T_identity, [0:1, 0:1, 0:224, 0:224, 0:1, 0:16] after the arrangement transformation corresponds to the six-dimensional NCHWnc.

The third func->fused_nn_conv2d_add_right_shift_clip_cast corresponds to the hardware convolution operation of VTA,

primfn(placeholder_2: handle, placeholder_3: handle, T_cast_1: handle) -> ()
  attr = {"global_symbol": "fused_nn_conv2d_add_right_shift_clip_cast", "tir.noalias": True}
  buffers = {T_cast: Buffer(T_cast_2: Pointer(int8), int8, [1, 1, 224, 224, 1, 16], []),
             placeholder: Buffer(placeholder_4: Pointer(int8), int8, [1, 1, 224, 224, 1, 16], []),
             placeholder_1: Buffer(placeholder_5: Pointer(int8), int8, [1, 1, 3, 3, 16, 16], [])}
  buffer_map = {placeholder_2: placeholder, placeholder_3: placeholder_1, T_cast_1: T_cast} {
  attr [T_cast] "realize_scope" = "";
  realize(T_cast, [0:1, 0:1, 0:224, 0:224, 0:1, 0:16], True {
    for (ax2.outer: int32, 0, 224) {
      for (ax3.outer: int32, 0, 224) {
        attr [res: Buffer(res_1: Pointer(int32), int32, [1, 1, 224, 224, 1, 16], [])] "realize_scope" = "local.acc_buffer";
        realize(res, [0:1, 0:1, ax2.outer:(ax2.outer + 1), ax3.outer:(ax3.outer + 1), 0:1, 0:16], True {
           {
            attr [[local.acc_buffer: Buffer(local.acc_buffer_1: Pointer(int32), int32, [1, 16], [], elem_offset=local.acc_buffer_elem_offset: int32, scope="local.acc_buffer", align=16, offset_factor=16), res]] "buffer_bind_scope" = @tir.tvm_tuple(0, 1, 0, 1, ax2.outer, 1, ax3.outer, 1, 0, 1, 0, 16, dtype=handle);
            attr [IterVar(vta: int32, (nullptr), "ThreadIndex", "vta")] "coproc_scope" = 2;
            attr [IterVar(vta, (nullptr), "ThreadIndex", "vta")] "coproc_uop_scope" = "VTAPushGEMMOp";
            @tir.vta.uop_push(0, 1, @tir.tvm_access_ptr(@tir.type_annotation(, dtype=int32), local.acc_buffer_1, local.acc_buffer_elem_offset, 16, 3, dtype=int32), 0, 0, 0, 0, 0, dtype=int32)
            attr [pad_data: Buffer(pad_data_1: Pointer(int8), int8, [1, 1, 226, 226, 1, 16], [])] "realize_scope" = "local.inp_buffer";
            realize(pad_data, [0:1, 0:1, ax2.outer:(ax2.outer + 3), ax3.outer:(ax3.outer + 3), 0:1, 0:16], True {
              attr [IterVar(i0: int32, (nullptr), "DataPar", "")] "pragma_dma_copy" = 1;
              for (i2: int32, 0, 3) {
                for (i3: int32, 0, 3) {
                  for (i5: int32, 0, 16) {
                    pad_data[0, 0, (i2 + ax2.outer), (i3 + ax3.outer), 0, i5] = @tir.if_then_else((((((i2 + ax2.outer) >= 1) && ((i2 + ax2.outer) < 225)) && ((i3 + ax3.outer) >= 1)) && ((i3 + ax3.outer) < 225)), placeholder[0, 0, ((i2 + ax2.outer) - 1), ((i3 + ax3.outer) - 1), 0, i5], 0i8, dtype=int8)
                  }
                }
              }
              attr [placeholder.local.wgt_buffer: Buffer(placeholder.local.wgt_buffer_1: Pointer(int8), int8, [1, 1, 3, 3, 16, 16], [])] "realize_scope" = "local.wgt_buffer";
              realize(placeholder.local.wgt_buffer, [0:1, 0:1, 0:3, 0:3, 0:16, 0:16], True {
                attr [IterVar(ax0: int32, (nullptr), "DataPar", "")] "pragma_dma_copy" = 1;
                for (ax2: int32, 0, 3) {
                  for (ax3: int32, 0, 3) {
                    for (ax4: int32, 0, 16) {
                      for (ax5: int32, 0, 16) {
                        placeholder.local.wgt_buffer[0, 0, ax2, ax3, ax4, ax5] = placeholder_1[0, 0, ax2, ax3, ax4, ax5]
                      }
                    }
                  }
                }
                for (d_j: int32, 0, 3) {
                  for (d_i: int32, 0, 3) {
                    attr [[local.inp_buffer: Buffer(local.inp_buffer_1: Pointer(int8), int8, [1, 16], [], elem_offset=local.inp_buffer_elem_offset: int32, scope="local.inp_buffer", align=16, offset_factor=16), pad_data]] "buffer_bind_scope" = @tir.tvm_tuple(0, 1, 0, 1, (ax2.outer + d_i), 1, (ax3.outer + d_j), 1, 0, 1, 0, 16, dtype=handle);
                    attr [[local.wgt_buffer: Buffer(local.wgt_buffer_1: Pointer(int8), int8, [16, 16], [], elem_offset=local.wgt_buffer_elem_offset: int32, scope="local.wgt_buffer", align=256, offset_factor=256), placeholder.local.wgt_buffer]] "buffer_bind_scope" = @tir.tvm_tuple(0, 1, 0, 1, d_i, 1, d_j, 1, 0, 16, 0, 16, dtype=handle);
                    attr [[local.acc_buffer, res]] "buffer_bind_scope" = @tir.tvm_tuple(0, 1, 0, 1, ax2.outer, 1, ax3.outer, 1, 0, 1, 0, 16, dtype=handle);
                    attr [IterVar(vta, (nullptr), "ThreadIndex", "vta")] "coproc_scope" = 2;
                    attr [IterVar(vta, (nullptr), "ThreadIndex", "vta")] "coproc_uop_scope" = "VTAPushGEMMOp";
                    @tir.vta.uop_push(0, 0, @tir.tvm_access_ptr(@tir.type_annotation(, dtype=int32), local.acc_buffer_1, local.acc_buffer_elem_offset, 16, 3, dtype=int32), @tir.tvm_access_ptr(@tir.type_annotation(, dtype=int8), local.inp_buffer_1, local.inp_buffer_elem_offset, 16, 1, dtype=int32), @tir.tvm_access_ptr(@tir.type_annotation(, dtype=int8), local.wgt_buffer_1, local.wgt_buffer_elem_offset, 256, 1, dtype=int32), 0, 0, 0, dtype=int32)
                  }
                }
              })
            })
          }
          attr [T_add: Buffer(T_add_1: Pointer(int32), int32, [1, 1, 224, 224, 1, 16], [])] "realize_scope" = "local.acc_buffer";
          realize(T_add, [0:1, 0:1, ax2.outer:(ax2.outer + 1), ax3.outer:(ax3.outer + 1), 0:1, 0:16], True {
            attr [IterVar(ax0_1: int32, (nullptr), "DataPar", "")] "pragma_alu" = 1;
            for (ax5_1: int32, 0, 16) {
              T_add[0, 0, ax2.outer, ax3.outer, 0, ax5_1] = (res[0, 0, ax2.outer, ax3.outer, 0, ax5_1] + 256)
            }
            attr [T_right_shift: Buffer(T_right_shift_1: Pointer(int32), int32, [1, 1, 224, 224, 1, 16], [])] "realize_scope" = "local.acc_buffer";
            realize(T_right_shift, [0:1, 0:1, ax2.outer:(ax2.outer + 1), ax3.outer:(ax3.outer + 1), 0:1, 0:16], True {
              attr [IterVar(ax0_2: int32, (nullptr), "DataPar", "")] "pragma_alu" = 1;
              for (ax5_2: int32, 0, 16) {
                T_right_shift[0, 0, ax2.outer, ax3.outer, 0, ax5_2] = @tir.shift_right(T_add[0, 0, ax2.outer, ax3.outer, 0, ax5_2], 9, dtype=int32)
              }
              attr [clipA: Buffer(clipA_1: Pointer(int32), int32, [1, 1, 224, 224, 1, 16], [])] "realize_scope" = "local.acc_buffer";
              realize(clipA, [0:1, 0:1, ax2.outer:(ax2.outer + 1), ax3.outer:(ax3.outer + 1), 0:1, 0:16], True {
                attr [IterVar(i0_1: int32, (nullptr), "DataPar", "")] "pragma_alu" = 1;
                for (i5_1: int32, 0, 16) {
                  clipA[0, 0, ax2.outer, ax3.outer, 0, i5_1] = min(T_right_shift[0, 0, ax2.outer, ax3.outer, 0, i5_1], 127)
                }
                attr [clipB: Buffer(clipB_1: Pointer(int32), int32, [1, 1, 224, 224, 1, 16], [])] "realize_scope" = "local.acc_buffer";
                realize(clipB, [0:1, 0:1, ax2.outer:(ax2.outer + 1), ax3.outer:(ax3.outer + 1), 0:1, 0:16], True {
                  attr [IterVar(i0_2: int32, (nullptr), "DataPar", "")] "pragma_alu" = 1;
                  for (i5_2: int32, 0, 16) {
                    clipB[0, 0, ax2.outer, ax3.outer, 0, i5_2] = max(clipA[0, 0, ax2.outer, ax3.outer, 0, i5_2], -127)
                  }
                  attr [IterVar(ax1.inner: int32, (nullptr), "DataPar", "")] "pragma_dma_copy" = 1;
                  for (ax5_3: int32, 0, 16) {
                    T_cast[0, 0, ax2.outer, ax3.outer, 0, ax5_3] = cast(int8, clipB[0, 0, ax2.outer, ax3.outer, 0, ax5_3])
                  }
                })
              })
            })
          })
        })
      }
    }
  })
}

You can see that the upper-level API of VTA such as tir.vta.uop_push has been mapped to the JIT function call of the VTA. This part is also the part of the actual operation of the VTA hardware.

The fourth func->fused_multiply_reshape_transpose_round_clip_cast, memory arrangement transformation, from six-dimensional to four-dimensional.

[04:55:57] /home/hht/tvm/src/ir/transform.cc:507: PrintIR():
#[version = "0.0.5"]
primfn(placeholder_1: handle, T_cast_1: handle) -> ()
  attr = {"global_symbol": "fused_multiply_reshape_transpose_round_clip_cast", "tir.noalias": True}
  buffers = {T_cast: Buffer(T_cast_2: Pointer(int8), int8, [1, 1, 224, 224, 1, 16], []),
             placeholder: Buffer(placeholder_2: Pointer(float32), float32, [1, 16, 224, 224], [])}
  buffer_map = {placeholder_1: placeholder, T_cast_1: T_cast} {
  attr [T_cast] "realize_scope" = "";
  realize(T_cast, [0:1, 0:1, 0:224, 0:224, 0:1, 0:16], True {
    for (ax0.ax1.fused.ax2.fused: int32, 0, 224) "parallel" {
      for (ax3: int32, 0, 224) {
        for (ax5.inner: int32, 0, 16) "vectorized" {
          T_cast[floordiv(ax0.ax1.fused.ax2.fused, 224), 0, floormod(ax0.ax1.fused.ax2.fused, 224), ax3, 0, ax5.inner] = cast(int8, max(min(@tir.round((placeholder[0, floormod(floordiv(floordiv(((((((((floordiv(ax0.ax1.fused.ax2.fused, 224) + 0) + 0)*16) + ax5.inner)*224) + floormod(ax0.ax1.fused.ax2.fused, 224))*224) + ax3), 224), 224), 16), floormod(floordiv(((((((((floordiv(ax0.ax1.fused.ax2.fused, 224) + 0) + 0)*16) + ax5.inner)*224) + floormod(ax0.ax1.fused.ax2.fused, 224))*224) + ax3), 224), 224), floormod(((((((((floordiv(ax0.ax1.fused.ax2.fused, 224) + 0) + 0)*16) + ax5.inner)*224) + floormod(ax0.ax1.fused.ax2.fused, 224))*224) + ax3), 224)]*16f32), dtype=float32), 127f32), -127f32))
        }
      }
    }
  })
}

/* For debugging purposes the metadata section has been omitted.
 * If you would like to see the full metadata section you can set the 
 * option to `True` when invoking `astext`. 
 */
***************build finished***************