Hi, I noticed that the FoldScaleAxis pass is not enabled for CUDA Winograd with weight transform precomputed and x86 NCHWc convolution. So the “broadcast_mul” op of batch norm doesn’t go away after compiling. Is it intended? @merrymercy@yzhliu
It needs to happen before simplify_inference, which is able to handle modified shape. otherwise once batch_norm is dissolved, it is not easy to target batch norm anymore, which will introduce layout transformer for these dissolved broadcast_add and broadcast_mul, etc. It can still work, but performance is not as good.
I think AlterOpLayout can happen before FoldScaleAxis, but does FoldScaleAxis depend on SimplifyInference ?
yes, FoldScaleAxis looks for broadcast_mul op, which is unpacked from batch norm. So FoldScaleAxis should happen after SimplifyInference.
This issue seems more complicated than I thought … And I don’t know how much enabling FoldScaleAxis would help improve performance. Probably the difference would be minimal.