I tried to measure my model using metal which is recommended for macos, however the result is very bad , as shown below

This troubles me a lot, because the result on cpu is faster than it which is about 50 ms.
With no idea, I tried changing metal to opencl , the result is much normal

the only difference in my code is just the target setting
In my opinion, either metal needs extra settings or metal on m1 has bugs now.
Anyone know what causes this problem?
