[BUG] Performance drop with batch and opt_level=3

I’m facing the same issue and I can reproduce it with a retinaface model (this model).

When batch size is 1, outputs are correct.

When batch size is 2 or bigger and opt_level is 1 or 2, outputs are correct.

When batch size is 2 or bigger and opt_level is 3, some outputs are wrong/different when compared to the original model.

Is this a bug? Any idea why this happens? @tqchen