VTA example: Matrix Multiply

there is a sentence of code in this example, I can’t understand, need help:

s[C_buf].reorder(
    ko,
    s[C_buf].op.axis[0],
    s[C_buf].op.axis[1],
    s[C_buf].op.axis[2],
    s[C_buf].op.axis[3],
    ki)
s[C_buf].tensorize(s[C_buf].op.axis[2], env.gemm)

two questions:

  1. what is reorder used for?
  2. what does tensorize do?
1 Like

Reorder is used to permute the loop axes of a loop nest.
You may remember the popular CPU matrix multiply example used to introduce locality:

for (i = 0; i < N; i++)
  for (j = 0; j < N; j++)
     for (k = 0; k < N; k++)
       \\blah blah

The observation is that when optimizing for locality, it may make sense to reorder the loop nest, perhaps into something like:

or (i = 0; i < N; i++)
  for (k = 0; j < N; j++)
     for (j = 0; k < N; k++)
       \\blah blah

The reorder function applies this transformation, with the order being the order of the axes passed in as arguments.

Tensorize is analogous to vectorization but for more general dense data shapes.
For example, while vectorization may do something like change

for (i = 0; i < 64; i++)
   C[i] = A[i] + B[i];

to

for (i = 0; i < 8; i += 8)
    _my_8wide_vector_add(i, A, B, C); //operate on multiple elements at once

Tensorize can be used for a transformation like:

for (i = 0; i < 4; i++)
    for (ii = 0 ; ii < 4; ii++) // 4x4 outer product outer loop
      for (jj == 0; jj < 4; jj++) // 4x4 outer product inner loop
        C[ii][jj] += A[i][ii] * B[jj][i];

to

for (i = 0; i < 4; i++)
    _my_4x4outer_product_function(A, B, C, i);
2 Likes

thank you very very much

i have another question which may be not easy to answer, but i really don’t know how to do it, so i will be very happy if you can give me a little tips.

TVM has many complicated data structure, i don’t know how to read it(i have read about two months, but …), so i’m very confused now …

In file vta/tutorials/matrix_multiply_opt.py line 245, there is:

s[res_gemm].reorder(ic_out, b_inn, oc_inn, ic_inn, b_tns, oc_tns, ic_tns)

from the code, whether it means s[res_gemm] is 6 dimensions? if it is, I print s[res_gemm].op.axis:

[iter_var(bo, Range(min=0, extent=1)), iter_var(co, Range(min=0, extent=64)), iter_var(bi, Range(min=0, extent=1)), iter_var(ci, Range(min=0, extent=16))]

it’s still 4 dimensions, why ?

thank you very much ~

The axes marked as _tns are converted into a single VTAUop.

It’s kind of like how when you’re using a GPU, you parallelize the outer loops over all of the CUDA cores, so those loops go away. In VTA, the inner loops which would otherwise do e.g., multiplication, element-by-element, are converted into calls to the FPGA GEMM core which does those operations all at once.

1 Like

Thank you very much~