Hi @merrymercy,
I found out that also NCHW + quantized (on ARM) gives weird errors. I am not tuning a whole network, but a single conv2d operation that I extracted from inception v3 and wrapped in a tflite file (which works fine with the Autotuner).
This is the error I am getting:
Check failed: found_attach || stage_attach.size() == 0 == false: Invalid Schedule, cannot find the producer compute(PadInput, body=[tir.if_then_else(((((i2 >= 1) && (i2 < 74)) && (i3 >= 1)) && (i3 < 74)), placeholder[i0, i1, (i2 - 1), (i3 - 1)], (int16)0)], axis=[iter_var(i0, range(min=0, ext=1)), iter_var(i1, range(min=0, ext=80)), iter_var(i2, range(min=0, ext=75)), iter_var(i3, range(min=0, ext=75))], reduce_axis=[], tag=injective,pad, attrs={}) along the loop nest specified by compute_at of consumer compute(data_vec, body=[PadInput[n, ci, (h + vh), (w + vw)]], axis=[iter_var(n, range(min=0, ext=1)), iter_var(h, range(min=0, ext=73)), iter_var(w, range(min=0, ext=73)), iter_var(ci, range(min=0, ext=80)), iter_var(vh, range(min=0, ext=3)), iter_var(vw, range(min=0, ext=3))], reduce_axis=[], tag=, attrs={})
And this is the dag obtained as you told me:
compile_engine_const() = 85
placeholder = PLACEHOLDER [1, 80, 73, 73]
PadInput(i0, i1, i2, i3) = tir.if_then_else(((((i2 >= 1) && (i2 < 74)) && (i3 >= 1)) && (i3 < 74)), placeholder[i0, i1, (i2 - 1), (i3 - 1)], (int16)0)
data_vec(n, h, w, ci, vh, vw) = PadInput[n, ci, (h + vh), (w + vw)]
placeholder = PLACEHOLDER [192, 80, 3, 3]
kernel_vec(co, ci, kh, kw, vc) = placeholder[((co*16) + vc), ci, kh, kw]
conv(n, co, h, w, vh, vw, vc) += (int32(data_vec[n, h, w, ci, (vh + kh), (vw + kw)])*int32(kernel_vec[co, ci, kh, kw, vc]))
output_unpack(n, co, h, w) = conv[n, floordiv(co, 16), h, w, 0, 0, floormod(co, 16)]
placeholder = PLACEHOLDER [1, 192, 1, 1]
T_add(ax0, ax1, ax2, ax3) = (output_unpack[ax0, ax1, ax2, ax3] + placeholder[ax0, ax1, 0, 0])
T_cast(ax0, ax1, ax2, ax3) = T_add[ax0, ax1, ax2, ax3]
compute(i0, i1, i2, i3) = tir.q_multiply_shift(T_cast[i0, i1, i2, i3], 1437270242, 31, -8)
T_add(ax0, ax1, ax2, ax3) = (compile_engine_const[] + compute[ax0, ax1, ax2, ax3])
compute(i0, i1, i2, i3) = max(min(T_add[i0, i1, i2, i3], 255), 0)
T_cast(ax0, ax1, ax2, ax3) = uint8(compute[ax0, ax1, ax2, ax3])
Side question: doesn’t every state have a different dag?
Thanks,