[MacOS][M1]Metal much slower than OpenCL

I tried to measure my model using metal which is recommended for macos, however the result is very bad , as shown below


This troubles me a lot, because the result on cpu is faster than it which is about 50 ms.

With no idea, I tried changing metal to opencl , the result is much normal


the only difference in my code is just the target setting

In my opinion, either metal needs extra settings or metal on m1 has bugs now.

Anyone know what causes this problem?

I am trying with the M1 Pro chip, and I also found this issue with the metal, which @vincentily has mentioned.

Is there anyone who can figure out this problem?

Have you tried AutoScheduler with metal target? The default scheduling is not good for this target, need to tune

ok , I will have a try, thanks

Hi,friend! Have you solved this problem? Is Metal faster than CPU after tuning?