Autocheduler error for NHWC + ARM back-end

giuseros · November 27, 2020, 4:03pm

Hi all,

I am trying using the Autoscheduler (aka, Ansor) for the ARM back-end (so, sorry in advance for newbie questions). While generating the schedule for quantized+NHWC (this means using the conv2d_gemm.py computation), I get this error:

  Check failed: found_attach || stage_attach.size() == 0 == false: Invalid Schedule, cannot find the producer compute(weight_flatten, body=[placeholder[floordiv(x, 16), 0, floormod(x, 16), y]], axis=[iter_var(x, range(min=0, ext=16)), iter_var(y, range(min=0, ext=96))], reduce_axis=[], tag=, attrs={}) along the loop nest specified by compute_at of consumer compute(weight_block_reshape, body=[weight_flatten[(w + (16*y)), (z + (4*x))]], axis=[iter_var(x, range(min=0, ext=24)), iter_var(y, range(min=0, ext=1)), iter_var(z, range(min=0, ext=4)), iter_var(w, range(min=0, ext=16))], reduce_axis=[], tag=, attrs={})

And from then onward the fail_ct keeps growing. I guess this is because an invalid program is passed to the evolutionary search. Is there a reason for this? Is quantized+NHWC not supported yet for ARM?

Thanks,

giuseros · November 27, 2020, 3:46pm

cc @FrozenGene, @comaniac, @merrymercy

FrozenGene · November 27, 2020, 4:47pm

Please firstly try to use fp32 on arm, it should work.

For quantized, I can not sure whether our upstreamed code support or not. The code is not uploaded completely. For ansor using conv2d_gemm, I haven’t verified this. I leveraged normal quantized conv2d computation and let ansor generate schedule before.

giuseros · November 27, 2020, 5:01pm

Hi @FrozenGene,

Thanks for your reply! Didn’t try fp32 yet on ARM. Quantized NCHW works (i.e., no errors produced). When I switch to NHWC the above error appears.

The strange thing is that I am not doing anything fancy: I only want to call the bare (unoptimized) computation with both transforms active (no tensorization, no pre-transform of the weights).

The error appears to be in the weight_flatten computation (so when I try to flatten the weights into a matrix)

merrymercy · November 27, 2020, 5:55pm

Could you provide a script to reproduce the error or print the task.compute_dag for the failed task?

As regards the performance, the CPU backend has not been finished and tested. It lacks a layout rewrite relay pass, which is critical for the performance. I will upstream it this week. I will then do a more thorough test for CPU backends.

giuseros · November 27, 2020, 5:59pm

Hi @merrymercy,

Thanks for your reply. I am not looking into performance yet, but only functionality.

What I am doing is to add autoscheduler support to tvmc, so that I will simply show you a single line call to tvmc.

As for printing the compute_dag, could you tell me where would be a good place to put a tvm::Dump?

Thanks,

merrymercy · November 27, 2020, 7:30pm

You can just paste the output of these lines.

github.com

apache/tvm/blob/093629c1e5651f526b3c8a2ae353f25050a89418/tutorials/auto_scheduler/tune_network_cuda.py#L156-L158


for idx, task in enumerate(tasks):
    print("========== Task %d  (workload key: %s) ==========" % (idx, task.workload_key))
    print(task.compute_dag)

and tell me the failed task id.

giuseros · November 27, 2020, 7:42pm

Hi @merrymercy, I found out that also NCHW + quantized (on ARM) gives weird errors. I am not tuning a whole network, but a single conv2d operation that I extracted from inception v3 and wrapped in a tflite file (which works fine with the Autotuner).

This is the error I am getting:

Check failed: found_attach || stage_attach.size() == 0 == false: Invalid Schedule, cannot find the producer compute(PadInput, body=[tir.if_then_else(((((i2 >= 1) && (i2 < 74)) && (i3 >= 1)) && (i3 < 74)), placeholder[i0, i1, (i2 - 1), (i3 - 1)], (int16)0)], axis=[iter_var(i0, range(min=0, ext=1)), iter_var(i1, range(min=0, ext=80)), iter_var(i2, range(min=0, ext=75)), iter_var(i3, range(min=0, ext=75))], reduce_axis=[], tag=injective,pad, attrs={}) along the loop nest specified by compute_at of consumer compute(data_vec, body=[PadInput[n, ci, (h + vh), (w + vw)]], axis=[iter_var(n, range(min=0, ext=1)), iter_var(h, range(min=0, ext=73)), iter_var(w, range(min=0, ext=73)), iter_var(ci, range(min=0, ext=80)), iter_var(vh, range(min=0, ext=3)), iter_var(vw, range(min=0, ext=3))], reduce_axis=[], tag=, attrs={})

And this is the dag obtained as you told me:

compile_engine_const() = 85
placeholder = PLACEHOLDER [1, 80, 73, 73]
PadInput(i0, i1, i2, i3) = tir.if_then_else(((((i2 >= 1) && (i2 < 74)) && (i3 >= 1)) && (i3 < 74)), placeholder[i0, i1, (i2 - 1), (i3 - 1)], (int16)0)
data_vec(n, h, w, ci, vh, vw) = PadInput[n, ci, (h + vh), (w + vw)]
placeholder = PLACEHOLDER [192, 80, 3, 3]
kernel_vec(co, ci, kh, kw, vc) = placeholder[((co*16) + vc), ci, kh, kw]
conv(n, co, h, w, vh, vw, vc) += (int32(data_vec[n, h, w, ci, (vh + kh), (vw + kw)])*int32(kernel_vec[co, ci, kh, kw, vc]))
output_unpack(n, co, h, w) = conv[n, floordiv(co, 16), h, w, 0, 0, floormod(co, 16)]
placeholder = PLACEHOLDER [1, 192, 1, 1]
T_add(ax0, ax1, ax2, ax3) = (output_unpack[ax0, ax1, ax2, ax3] + placeholder[ax0, ax1, 0, 0])
T_cast(ax0, ax1, ax2, ax3) = T_add[ax0, ax1, ax2, ax3]
compute(i0, i1, i2, i3) = tir.q_multiply_shift(T_cast[i0, i1, i2, i3], 1437270242, 31, -8)
T_add(ax0, ax1, ax2, ax3) = (compile_engine_const[] + compute[ax0, ax1, ax2, ax3])
compute(i0, i1, i2, i3) = max(min(T_add[i0, i1, i2, i3], 255), 0)
T_cast(ax0, ax1, ax2, ax3) = uint8(compute[ax0, ax1, ax2, ax3])

Side question: doesn’t every state have a different dag?

Thanks,

merrymercy · November 27, 2020, 10:10pm

@giuseros

It seems that you are using an autotvm template to provide the compute definition for the auto scheduler. So you think “every state has a different dag”, which is true for autotvm template. Because we embed tile size in the compute definition.

But in autoscheduler, this is not true. In autoscheduler, all states share the same DAG. All states only have different schedules. The autoscheduler infers a good layout from the schedule and does the layout rewrite automatically. Although the autoscheduler is expected to work correctly for any compute definitions, it is not recommended to directly reuse the packed compute definition from autotvm template. Instead, we should provide an unpacked compute definition (which is simpler) and let the autoscheduler infer a good layout.

For your problem, I cannot figure it out from the DAG. So I need your script to help you debug.

Where did you get this error? Is it during SampleInitPolulation? Is it during GA mutation? A full trace back with contextual output is helpful.

giuseros · November 30, 2020, 4:55pm

Thanks for the detailed (and interesting) answer.

So, right: the sketch defines the dag, and then different states share the same (initial) dag. But different sketches have different dags. Is this correct?

A very interesting thing you said: “The autoscheduler infers a good layout from the schedule and does the layout rewrite automatically”. Let’s say I have a GEMM computation, and then I want to pack my matrix A to have a more friendly data-layout (see this RFC, for instance). Can I inject in the autoscheduler my data-packing?

As for debugging further, once I have autoscheduler in tvmc I will send you the tvmc command so that we can start from a common ground. Meanwhile, I will try to put some printfs in the code to see where the error is showing.

Thanks,

merrymercy · November 30, 2020, 9:50pm

@giuseros

The DAG here only refers to the compute description that is defined by tvm.te.compute. It does not have anything related to implementation or optimization. So all sketches have the same DAG, because they all come from definitions like this

github.com

apache/tvm/blob/408f4ffa5c0cb2659d088302443df5a4ef572bfa/tutorials/auto_scheduler/tune_matmul_x86.py#L52-L62


@auto_scheduler.register_workload
def matmul_add(N, L, M, dtype):
    A = te.placeholder((N, L), name="A", dtype=dtype)
    B = te.placeholder((L, M), name="B", dtype=dtype)
    C = te.placeholder((N, M), name="C", dtype=dtype)

    k = te.reduce_axis((0, L), name="k")
    matmul = te.compute((N, M), lambda i, j: te.sum(A[i, k] * B[k, j], axis=k), name="matmul")
    out = te.compute((N, M), lambda i, j: matmul[i, j] + C[i, j], name="out")

    return [A, B, C, out]

giuseros · November 30, 2020, 10:03pm

Sorry, but I am bit confused. From the seminal paper on Ansor:

" Note that new nodes(e.g., caching nodes, layout transform nodes) may also be introduced to the DAG during the sketch generation"

And indeed, looking at the file sketch_policy.cc, in the definition of the StateNode class:

  /*! \brief The up-to-date ComputeDAG of this state. The default value is an empty NullOpt,
   * meaning the dag of this state is the same as the original ComputeDAG in the SearchTask.
   * Otherwise, the stored value is the up-to-date ComputeDAG for this state, meaning some steps
   * (e.g., CacheReadStep/CacheWriteStep) have modified the ComputeDAG.
   */
  Optional<ObjectRef> current_compute_dag;

So, the dag of a given state (during the sketch generation) might be different from the original dag (stored in the search task). What am I missing?

Thanks,

comaniac · November 30, 2020, 10:43pm

I think this is just the term issue. Lianmin is correct in terms of the implementation. ComputeDAG is basically read only during the tuning process, but its schedule may change the number of stages. For example:

ComputeDAG: in -> compute -> out
TransfromSteps: CacheRead(in)
State: in -> local_in -> compute -> out

It’s like what we did in TOPI. All TOPI schedules share the same TOPI compute, and a schedule can transform the lowered program by introducing more stages.

On the other hand, if you want to change the compute (i.e., you need a different TE compute function, such as NCHW to NCHWc), then you need the layout rewrite pass Lainmin mentioned. Furthermore, if the compute change cannot be supported by layout rewrite (e.g., direct to Winograd), then you need to provide the right TE compute by yourself, or make sure the Relay op strategy selects the right TOPI compute if you’re tuning a Relay program.

merrymercy · December 1, 2020, 1:14pm

We messed up some terms.

In post 5, I asked you to print the “ComputeDAG”. Here, I only refer to the initial compute definition, i.e., task.compute_dag. This dag is shared by all sketches and states. It can give me information to reproduce your bug, because I can re-construct the search task with this DAG and run the search by myself.

In post 8. You asked, “doesn’t every state have a different dag?”. You refer to the current_compute_dag in each state. I typically just call this a “state”, because it is a field of LoopState

To avoid confusion in the future. We can call task.compute_dag as task DAG and state.current_compute_dag as state DAG.

giuseros · December 1, 2020, 5:53pm

Thanks @merrymercy, @comaniac for the detailed explanation, now it’s all clear. Also sorry with the naming confusion, I am still new to that part of the codebase