[Autoscheduler] Why no winograd for NCHW layout?

masahi · February 17, 2021, 10:07am

Recently I’ve been tuning maskrcnn workloads exclusively on NCHW layout. This week I tried NHWC tuning, and I’m happy to find that the result on NHWC is better (51 ms vs 45 ms).

Looking at the profiling result, I realized that winograd is only enabled for NHWC layout. Why is that? Simply due to lack of time?

FrozenGene · February 17, 2021, 10:20am

Hi @masahi I implemented it winograd for NCHW during ansor development too. But because we mainly optimize for NHWC layout(find it better on cpu and gpu), so we don’t enable it for NCHW when we upstream. This is the same reason as we mainly optimize NHWC layout on other hardwares (like mali / arm…), not only just winogrod.

masahi · February 17, 2021, 10:47am

I still want to see winograd enabled for NCHW too, since NHWC has many issues due to a lack of focus from the community.

For example, I had to spend some time to support roi_align on NHWC. Although faster rcnn works great on NHWC after my roi_align fix, maskrcnn, due to its dynamic batch conv2d and conv2d_transpose, do not seem to work on NHWC at all. Trying to compile dynamic, NHWC conv2d gives an obscure error (something related to shared memory). Worse, there is no NHWC conv2d_transpose operator.

So I think unusual workload like that is where NHWC would easily break. Other issues include too many layout_transform, obscure runtime error from shape func that assumes NCHW etc.

FrozenGene · February 17, 2021, 12:02pm

I understand your concern. For NCHW, we should support indeed, like MXNet / ONNX model is NCHW layout naturally. However, we should have many things to handle besides winograd because we only enable NHWC in our op strategy currently, I think it is worthy a task (and we should discuss how to handle NCHWc with graph tuner too). At the same time, our topi should provide more support on NHWC layout (For example, TensorCore like NHWC layout more and more frameworks provide channel last support including PyTorch (beta) Channels Last Memory Format in PyTorch — PyTorch Tutorials 1.7.1 documentation)

merrymercy · February 19, 2021, 1:26am

Yes, the reason is lack of time. You can mimic the NHWC winograd and implement it for NCHW. Basically, you need to write the compute definition and register it to OpStrategy.

I think all the original Ansor developers won’t have time to do this.

masahi · February 19, 2021, 2:41am

no problem, maybe I’ll do it when I have time