Couldn't find vta codegen file

Rahul · October 5, 2020, 6:08am

I could see lot of codegen src files under src/target/source for different hardware backends, but couldn’t find a corresponding file for vta. Could you please let me know which src file is responsible to convert tensor ir to vta hardware backend code.

hht · October 6, 2020, 8:57am

It can be interpreted as two stages，vta uses jit runtime so the backend code is generated in the runtime according to /vta/runtime/runtime.cc. In runtime it just transforms vta api to machine code.

In the compile time, vta transforms IR to vta api according to /vta/python/vta/transform.py. It just transforms schedule pragma to the corresponding function.

hht · October 6, 2020, 9:07am

Transformation process of single-layer convolutional network at compile time.

First, the first step is to create a network model with only convolution and print Relay IR.

def @main(%data: Tensor[(1, 16, 224, 224), float32], %weight: Tensor[(16, 16, 3, 3), float32]) -> Tensor[(1, 16, 224, 224), float32] {
  nn.conv2d(%data, %weight, padding=[1, 1, 1, 1], channels=16, kernel_size=[3, 3]) /* ty=Tensor[(1, 16, 224, 224), float32] */
}

Only nn.conv2d means that it is a single-layer network, which is also our requirement to start with a simple network. The following is to deploy this layer of network to the VTA, so

The second step is to quantify and print Relay IR

def @main(%data: Tensor[(1, 16, 224, 224), float32]) -> Tensor[(1, 16, 224, 224), float32] {
  %0 = multiply(%data, 16f /* ty=float32 */) /* ty=Tensor[(1, 16, 224, 224), float32] */;
  %1 = round(%0) /* ty=Tensor[(1, 16, 224, 224), float32] */;
  %2 = clip(%1, a_min=-127f, a_max=127f) /* ty=Tensor[(1, 16, 224, 224), float32] */;
  %3 = cast(%2, dtype="int8") /* ty=Tensor[(1, 16, 224, 224), int8] */;
  %4 = nn.conv2d(%3, meta[relay.Constant][0] /* ty=Tensor[(16, 16, 3, 3), int8] */, padding=[1, 1, 1, 1], channels=16, kernel_size=[3, 3], out_dtype="int32") /* ty=Tensor[(1, 16, 224, 224), int32] */;
  %5 = add(%4, 256 /* ty=int32 */) /* ty=Tensor[(1, 16, 224, 224), int32] */;
  %6 = right_shift(%5, 9 /* ty=int32 */) /* ty=Tensor[(1, 16, 224, 224), int32] */;
  %7 = clip(%6, a_min=-127f, a_max=127f) /* ty=Tensor[(1, 16, 224, 224), int32] */;
  %8 = cast(%7, dtype="int8") /* ty=Tensor[(1, 16, 224, 224), int8] */;
  %9 = annotation.stop_fusion(%8) /* ty=Tensor[(1, 16, 224, 224), int8] */;
  %10 = cast(%9, dtype="float32") /* ty=Tensor[(1, 16, 224, 224), float32] */;
  multiply(%10, 0.0625f /* ty=float32 */) /* ty=Tensor[(1, 16, 224, 224), float32] */
}

You can see that quantization adds operators such as multiply, round, clip, cast, ʻadd, and right_shift` to the network

The third step is to change the memory layout of NCHW to NCHW16n16c, and print Relay IR

fn (%data: Tensor[(1, 16, 224, 224), float32]) -> Tensor[(1, 16, 224, 224), float32] {
  %0 = multiply(%data, 16f /* ty=float32 */) /* ty=Tensor[(1, 16, 224, 224), float32] */;
  %1 = reshape(%0, newshape=[1, 1, 1, 16, 224, 224]) /* ty=Tensor[(1, 1, 1, 16, 224, 224), float32] */;
  %2 = transpose(%1, axes=[0, 2, 4, 5, 1, 3]) /* ty=Tensor[(1, 1, 224, 224, 1, 16), float32] */;
  %3 = round(%2) /* ty=Tensor[(1, 1, 224, 224, 1, 16), float32] */;
  %4 = clip(%3, a_min=-127f, a_max=127f) /* ty=Tensor[(1, 1, 224, 224, 1, 16), float32] */;
  %5 = cast(%4, dtype="int8") /* ty=Tensor[(1, 1, 224, 224, 1, 16), int8] */;
  %6 = reshape(meta[relay.Constant][0] /* ty=Tensor[(16, 16, 3, 3), int8] */, newshape=[1, 16, 1, 16, 3, 3]) /* ty=Tensor[(1, 16, 1, 16, 3, 3), int8] */;
  %7 = transpose(%6, axes=[0, 2, 4, 5, 1, 3]) /* ty=Tensor[(1, 1, 3, 3, 16, 16), int8] */;
  %8 = nn.conv2d(%5, %7, padding=[1, 1, 1, 1], channels=16, kernel_size=[3, 3], data_layout="NCHW1n16c", kernel_layout="OIHW16o16i", out_dtype="int32") /* ty=Tensor[(1, 1, 224, 224, 1, 16), int32] */;
  %9 = add(%8, 256 /* ty=int32 */) /* ty=Tensor[(1, 1, 224, 224, 1, 16), int32] */;
  %10 = right_shift(%9, 9 /* ty=int32 */) /* ty=Tensor[(1, 1, 224, 224, 1, 16), int32] */;
  %11 = clip(%10, a_min=-127f, a_max=127f) /* ty=Tensor[(1, 1, 224, 224, 1, 16), int32] */;
  %12 = cast(%11, dtype="int8") /* ty=Tensor[(1, 1, 224, 224, 1, 16), int8] */;
  %13 = copy(%12) /* ty=Tensor[(1, 1, 224, 224, 1, 16), int8] */;
  %14 = annotation.stop_fusion(%13) /* ty=Tensor[(1, 1, 224, 224, 1, 16), int8] */;
  %15 = transpose(%14, axes=[0, 4, 1, 5, 2, 3]) /* ty=Tensor[(1, 1, 1, 16, 224, 224), int8] */;
  %16 = reshape(%15, newshape=[1, 16, 224, 224]) /* ty=Tensor[(1, 16, 224, 224), int8] */;
  %17 = cast(%16, dtype="float32") /* ty=Tensor[(1, 16, 224, 224), float32] */;
  multiply(%17, 0.0625f /* ty=float32 */) /* ty=Tensor[(1, 16, 224, 224), float32] */
}

You can see that the memory layout transformation adds operators such as reshape, transpose, and copy to the network.

The preparations for Relay IR have been completed, and now start to compile, so

The fourth step is to add the pass of PrintIR, print the transformation during compilation, you can see a warning

Cannot find config for target=ext_dev -keys=vta,cpu -device=vta -model=sim_1x16_i8w8a32_15_15_18_17, workload=('conv2d_packed.vta', ('TENSOR', (1, 1, 224, 224, 1, 16), 'int8'), ('TENSOR', (1, 1, 3, 3, 16, 16), 'int8'), (1, 1), (1, 1, 1, 1), (1, 1), 'NCHW1n16c', 'int32'). A fallback configuration is used, which may bring great performance regression.

This warning is to remind our network that there is no corresponding tuning scheduling parameter in workload = conv2d_packed.vta. This is because there is no corresponding auto tuning parameter in tophub. In fact, the real reason is that I did not write to search for tuning parameters from tophub. The sentence, because it is not available, our purpose is to analyze the VTA compilation process

By adding the pass of PrintIR, you can see that the compiled module is divided according to four fused_func,

The first func->fused_transpose_reshape_cast_multiply, because the arrangement transformation is done after the first multiply, so

[04:55:57] /home/hht/tvm/src/ir/transform.cc:507: PrintIR():
#[version = "0.0.5"]
primfn(placeholder_1: handle, T_multiply_1: handle) -> ()
  attr = {"global_symbol": "fused_transpose_reshape_cast_multiply", "tir.noalias": True}
  buffers = {T_multiply: Buffer(T_multiply_2: Pointer(float32), float32, [1, 16, 224, 224], []),
             placeholder: Buffer(placeholder_2: Pointer(int8), int8, [1, 1, 224, 224, 1, 16], [])}
  buffer_map = {placeholder_1: placeholder, T_multiply_1: T_multiply} {
  attr [T_multiply] "realize_scope" = "";
  realize(T_multiply, [0:1, 0:16, 0:224, 0:224], True {
    for (ax0.ax1.fused: int32, 0, 16) "parallel" {
      for (ax2: int32, 0, 224) {
        for (ax3.outer: int32, 0, 14) {
          for (ax3.inner: int32, 0, 16) "vectorized" {
            T_multiply[floordiv(ax0.ax1.fused, 16), floormod(ax0.ax1.fused, 16), ax2, (ax3.inner + (ax3.outer*16))] = (cast(float32, placeholder[0, 0, floormod(floordiv(((((((floordiv(ax0.ax1.fused, 16)*16) + floormod(ax0.ax1.fused, 16))*224) + ax2)*224) + (ax3.inner + (ax3.outer*16))), 224), 224), floormod(((((((floordiv(ax0.ax1.fused, 16)*16) + floormod(ax0.ax1.fused, 16))*224) + ax2)*224) + (ax3.inner + (ax3.outer*16))), 224), 0, floormod(floordiv(floordiv(((((((floordiv(ax0.ax1.fused, 16)*16) + floormod(ax0.ax1.fused, 16))*224) + ax2)*224) + (ax3.inner + (ax3.outer*16))), 224), 224), 16)])*0.0625f32)
          }
        }
      }
    }
  })
}

You can see that it is realize(T_multiply, [0:1, 0:16, 0:224, 0:224] before the arrangement transformation or the four-dimensional NCHW.

The second func->fused_copy,

[04:55:57] /home/hht/tvm/src/ir/transform.cc:507: PrintIR():
#[version = "0.0.5"]
primfn(placeholder_1: handle, T_identity_1: handle) -> ()
  attr = {"global_symbol": "fused_copy", "tir.noalias": True}
  buffers = {T_identity: Buffer(T_identity_2: Pointer(int8), int8, [1, 1, 224, 224, 1, 16], []),
             placeholder: Buffer(placeholder_2: Pointer(int8), int8, [1, 1, 224, 224, 1, 16], [])}
  buffer_map = {placeholder_1: placeholder, T_identity_1: T_identity} {
  attr [T_identity] "realize_scope" = "";
  realize(T_identity, [0:1, 0:1, 0:224, 0:224, 0:1, 0:16], True {
    for (ax0.ax1.fused.ax2.fused: int32, 0, 224) "parallel" {
      for (ax3: int32, 0, 224) {
        for (ax5.inner: int32, 0, 16) "vectorized" {
          T_identity[floordiv(ax0.ax1.fused.ax2.fused, 224), 0, floormod(ax0.ax1.fused.ax2.fused, 224), ax3, 0, ax5.inner] = placeholder[floordiv(ax0.ax1.fused.ax2.fused, 224), 0, floormod(ax0.ax1.fused.ax2.fused, 224), ax3, 0, ax5.inner]
        }
      }
    }
  })
}

It can be seen that the realize(T_identity, [0:1, 0:1, 0:224, 0:224, 0:1, 0:16] after the arrangement transformation corresponds to the six-dimensional NCHWnc.

The third func->fused_nn_conv2d_add_right_shift_clip_cast corresponds to the hardware convolution operation of VTA,

primfn(placeholder_2: handle, placeholder_3: handle, T_cast_1: handle) -> ()
  attr = {"global_symbol": "fused_nn_conv2d_add_right_shift_clip_cast", "tir.noalias": True}
  buffers = {T_cast: Buffer(T_cast_2: Pointer(int8), int8, [1, 1, 224, 224, 1, 16], []),
             placeholder: Buffer(placeholder_4: Pointer(int8), int8, [1, 1, 224, 224, 1, 16], []),
             placeholder_1: Buffer(placeholder_5: Pointer(int8), int8, [1, 1, 3, 3, 16, 16], [])}
  buffer_map = {placeholder_2: placeholder, placeholder_3: placeholder_1, T_cast_1: T_cast} {
  attr [T_cast] "realize_scope" = "";
  realize(T_cast, [0:1, 0:1, 0:224, 0:224, 0:1, 0:16], True {
    for (ax2.outer: int32, 0, 224) {
      for (ax3.outer: int32, 0, 224) {
        attr [res: Buffer(res_1: Pointer(int32), int32, [1, 1, 224, 224, 1, 16], [])] "realize_scope" = "local.acc_buffer";
        realize(res, [0:1, 0:1, ax2.outer:(ax2.outer + 1), ax3.outer:(ax3.outer + 1), 0:1, 0:16], True {
           {
            attr [[local.acc_buffer: Buffer(local.acc_buffer_1: Pointer(int32), int32, [1, 16], [], elem_offset=local.acc_buffer_elem_offset: int32, scope="local.acc_buffer", align=16, offset_factor=16), res]] "buffer_bind_scope" = @tir.tvm_tuple(0, 1, 0, 1, ax2.outer, 1, ax3.outer, 1, 0, 1, 0, 16, dtype=handle);
            attr [IterVar(vta: int32, (nullptr), "ThreadIndex", "vta")] "coproc_scope" = 2;
            attr [IterVar(vta, (nullptr), "ThreadIndex", "vta")] "coproc_uop_scope" = "VTAPushGEMMOp";
            @tir.vta.uop_push(0, 1, @tir.tvm_access_ptr(@tir.type_annotation(, dtype=int32), local.acc_buffer_1, local.acc_buffer_elem_offset, 16, 3, dtype=int32), 0, 0, 0, 0, 0, dtype=int32)
            attr [pad_data: Buffer(pad_data_1: Pointer(int8), int8, [1, 1, 226, 226, 1, 16], [])] "realize_scope" = "local.inp_buffer";
            realize(pad_data, [0:1, 0:1, ax2.outer:(ax2.outer + 3), ax3.outer:(ax3.outer + 3), 0:1, 0:16], True {
              attr [IterVar(i0: int32, (nullptr), "DataPar", "")] "pragma_dma_copy" = 1;
              for (i2: int32, 0, 3) {
                for (i3: int32, 0, 3) {
                  for (i5: int32, 0, 16) {
                    pad_data[0, 0, (i2 + ax2.outer), (i3 + ax3.outer), 0, i5] = @tir.if_then_else((((((i2 + ax2.outer) >= 1) && ((i2 + ax2.outer) < 225)) && ((i3 + ax3.outer) >= 1)) && ((i3 + ax3.outer) < 225)), placeholder[0, 0, ((i2 + ax2.outer) - 1), ((i3 + ax3.outer) - 1), 0, i5], 0i8, dtype=int8)
                  }
                }
              }
              attr [placeholder.local.wgt_buffer: Buffer(placeholder.local.wgt_buffer_1: Pointer(int8), int8, [1, 1, 3, 3, 16, 16], [])] "realize_scope" = "local.wgt_buffer";
              realize(placeholder.local.wgt_buffer, [0:1, 0:1, 0:3, 0:3, 0:16, 0:16], True {
                attr [IterVar(ax0: int32, (nullptr), "DataPar", "")] "pragma_dma_copy" = 1;
                for (ax2: int32, 0, 3) {
                  for (ax3: int32, 0, 3) {
                    for (ax4: int32, 0, 16) {
                      for (ax5: int32, 0, 16) {
                        placeholder.local.wgt_buffer[0, 0, ax2, ax3, ax4, ax5] = placeholder_1[0, 0, ax2, ax3, ax4, ax5]
                      }
                    }
                  }
                }
                for (d_j: int32, 0, 3) {
                  for (d_i: int32, 0, 3) {
                    attr [[local.inp_buffer: Buffer(local.inp_buffer_1: Pointer(int8), int8, [1, 16], [], elem_offset=local.inp_buffer_elem_offset: int32, scope="local.inp_buffer", align=16, offset_factor=16), pad_data]] "buffer_bind_scope" = @tir.tvm_tuple(0, 1, 0, 1, (ax2.outer + d_i), 1, (ax3.outer + d_j), 1, 0, 1, 0, 16, dtype=handle);
                    attr [[local.wgt_buffer: Buffer(local.wgt_buffer_1: Pointer(int8), int8, [16, 16], [], elem_offset=local.wgt_buffer_elem_offset: int32, scope="local.wgt_buffer", align=256, offset_factor=256), placeholder.local.wgt_buffer]] "buffer_bind_scope" = @tir.tvm_tuple(0, 1, 0, 1, d_i, 1, d_j, 1, 0, 16, 0, 16, dtype=handle);
                    attr [[local.acc_buffer, res]] "buffer_bind_scope" = @tir.tvm_tuple(0, 1, 0, 1, ax2.outer, 1, ax3.outer, 1, 0, 1, 0, 16, dtype=handle);
                    attr [IterVar(vta, (nullptr), "ThreadIndex", "vta")] "coproc_scope" = 2;
                    attr [IterVar(vta, (nullptr), "ThreadIndex", "vta")] "coproc_uop_scope" = "VTAPushGEMMOp";
                    @tir.vta.uop_push(0, 0, @tir.tvm_access_ptr(@tir.type_annotation(, dtype=int32), local.acc_buffer_1, local.acc_buffer_elem_offset, 16, 3, dtype=int32), @tir.tvm_access_ptr(@tir.type_annotation(, dtype=int8), local.inp_buffer_1, local.inp_buffer_elem_offset, 16, 1, dtype=int32), @tir.tvm_access_ptr(@tir.type_annotation(, dtype=int8), local.wgt_buffer_1, local.wgt_buffer_elem_offset, 256, 1, dtype=int32), 0, 0, 0, dtype=int32)
                  }
                }
              })
            })
          }
          attr [T_add: Buffer(T_add_1: Pointer(int32), int32, [1, 1, 224, 224, 1, 16], [])] "realize_scope" = "local.acc_buffer";
          realize(T_add, [0:1, 0:1, ax2.outer:(ax2.outer + 1), ax3.outer:(ax3.outer + 1), 0:1, 0:16], True {
            attr [IterVar(ax0_1: int32, (nullptr), "DataPar", "")] "pragma_alu" = 1;
            for (ax5_1: int32, 0, 16) {
              T_add[0, 0, ax2.outer, ax3.outer, 0, ax5_1] = (res[0, 0, ax2.outer, ax3.outer, 0, ax5_1] + 256)
            }
            attr [T_right_shift: Buffer(T_right_shift_1: Pointer(int32), int32, [1, 1, 224, 224, 1, 16], [])] "realize_scope" = "local.acc_buffer";
            realize(T_right_shift, [0:1, 0:1, ax2.outer:(ax2.outer + 1), ax3.outer:(ax3.outer + 1), 0:1, 0:16], True {
              attr [IterVar(ax0_2: int32, (nullptr), "DataPar", "")] "pragma_alu" = 1;
              for (ax5_2: int32, 0, 16) {
                T_right_shift[0, 0, ax2.outer, ax3.outer, 0, ax5_2] = @tir.shift_right(T_add[0, 0, ax2.outer, ax3.outer, 0, ax5_2], 9, dtype=int32)
              }
              attr [clipA: Buffer(clipA_1: Pointer(int32), int32, [1, 1, 224, 224, 1, 16], [])] "realize_scope" = "local.acc_buffer";
              realize(clipA, [0:1, 0:1, ax2.outer:(ax2.outer + 1), ax3.outer:(ax3.outer + 1), 0:1, 0:16], True {
                attr [IterVar(i0_1: int32, (nullptr), "DataPar", "")] "pragma_alu" = 1;
                for (i5_1: int32, 0, 16) {
                  clipA[0, 0, ax2.outer, ax3.outer, 0, i5_1] = min(T_right_shift[0, 0, ax2.outer, ax3.outer, 0, i5_1], 127)
                }
                attr [clipB: Buffer(clipB_1: Pointer(int32), int32, [1, 1, 224, 224, 1, 16], [])] "realize_scope" = "local.acc_buffer";
                realize(clipB, [0:1, 0:1, ax2.outer:(ax2.outer + 1), ax3.outer:(ax3.outer + 1), 0:1, 0:16], True {
                  attr [IterVar(i0_2: int32, (nullptr), "DataPar", "")] "pragma_alu" = 1;
                  for (i5_2: int32, 0, 16) {
                    clipB[0, 0, ax2.outer, ax3.outer, 0, i5_2] = max(clipA[0, 0, ax2.outer, ax3.outer, 0, i5_2], -127)
                  }
                  attr [IterVar(ax1.inner: int32, (nullptr), "DataPar", "")] "pragma_dma_copy" = 1;
                  for (ax5_3: int32, 0, 16) {
                    T_cast[0, 0, ax2.outer, ax3.outer, 0, ax5_3] = cast(int8, clipB[0, 0, ax2.outer, ax3.outer, 0, ax5_3])
                  }
                })
              })
            })
          })
        })
      }
    }
  })
}

You can see that the upper-level API of VTA such as tir.vta.uop_push has been mapped to the JIT function call of the VTA. This part is also the part of the actual operation of the VTA hardware.

The fourth func->fused_multiply_reshape_transpose_round_clip_cast, memory arrangement transformation, from six-dimensional to four-dimensional.

[04:55:57] /home/hht/tvm/src/ir/transform.cc:507: PrintIR():
#[version = "0.0.5"]
primfn(placeholder_1: handle, T_cast_1: handle) -> ()
  attr = {"global_symbol": "fused_multiply_reshape_transpose_round_clip_cast", "tir.noalias": True}
  buffers = {T_cast: Buffer(T_cast_2: Pointer(int8), int8, [1, 1, 224, 224, 1, 16], []),
             placeholder: Buffer(placeholder_2: Pointer(float32), float32, [1, 16, 224, 224], [])}
  buffer_map = {placeholder_1: placeholder, T_cast_1: T_cast} {
  attr [T_cast] "realize_scope" = "";
  realize(T_cast, [0:1, 0:1, 0:224, 0:224, 0:1, 0:16], True {
    for (ax0.ax1.fused.ax2.fused: int32, 0, 224) "parallel" {
      for (ax3: int32, 0, 224) {
        for (ax5.inner: int32, 0, 16) "vectorized" {
          T_cast[floordiv(ax0.ax1.fused.ax2.fused, 224), 0, floormod(ax0.ax1.fused.ax2.fused, 224), ax3, 0, ax5.inner] = cast(int8, max(min(@tir.round((placeholder[0, floormod(floordiv(floordiv(((((((((floordiv(ax0.ax1.fused.ax2.fused, 224) + 0) + 0)*16) + ax5.inner)*224) + floormod(ax0.ax1.fused.ax2.fused, 224))*224) + ax3), 224), 224), 16), floormod(floordiv(((((((((floordiv(ax0.ax1.fused.ax2.fused, 224) + 0) + 0)*16) + ax5.inner)*224) + floormod(ax0.ax1.fused.ax2.fused, 224))*224) + ax3), 224), 224), floormod(((((((((floordiv(ax0.ax1.fused.ax2.fused, 224) + 0) + 0)*16) + ax5.inner)*224) + floormod(ax0.ax1.fused.ax2.fused, 224))*224) + ax3), 224)]*16f32), dtype=float32), 127f32), -127f32))
        }
      }
    }
  })
}

/* For debugging purposes the metadata section has been omitted.
 * If you would like to see the full metadata section you can set the 
 * option to `True` when invoking `astext`. 
 */
***************build finished***************

hht · October 6, 2020, 9:11am

Above is my weekly report with google translation. Hope useful for you.

Rahul · October 12, 2020, 10:25am

Really thank you for such a detailed answer. There is one more file inside python/vta which is intrin.py which seems to be emitting the code for gemm. I am little bit confused between the functionality of intrin.py and transform.py as you mentioned. Can you please shed some light.