Since you didn’t post the string that describes your target (e.g. llvm -mcpu=xxx other flags) I can only offer a wild guess. Under some circumstances I have found that forcing a particular layout (e.g. NHCW vs NCHW) and data types (int 8 vs int16) you will get a worse schedule which results in longer latencies. So choosing a particular optimization level may force the compiler to generate a worse schedule?
Choosing metal is more restrictive and thus guides the scheduling at compile time which gives better latencies. Again, it’s a wild guess and I could be completely wrong.