Using meta scheduler to auto tuning float16 relax module can not achieve good performance

I use ToMixedPrecision pass to transform stable diffusion model to float16. After that I use meta schedule to auto tuning my float16 model , but after 50000 iterators ,total latency is 1.9e7us ,it is too slow ,I can achieve 400ms total latency on my float32 model. It is tested on my nvidia titian xp machine ,using cuda

My video card is too old ,using rtx card works fine