FP16 result loss compared with FP32

I noticed that atol/rtol value are range from 1e-3 to 0.05 in test_to_mixed_precision.py. e.g. for single conv, the compared policy is follows:

verify_mixed_precision_output_close(mod, mod_params, atol=0.01, rtol=1e-3)

And if I changed the atol to atol=1e-3 I will get errors like this:

Mismatched elements: 496 / 5120 (9.69%)
Max absolute difference: 0.01308918
Max relative difference: 0.59108096
 x: array([[[[-1.170484, -0.899734,  0.29025 , ..., -1.748149, -2.216193,
           1.223453],
         [-1.448114,  1.222129, -1.340144, ..., -1.264967, -0.286925,...
 y: array([[[[-1.171  , -0.9    ,  0.29   , ..., -1.75   , -2.215  ,
           1.224  ],
         [-1.448  ,  1.224  , -1.337  , ..., -1.265  , -0.2886 ,...

So my question is how can we evaluate the atol/rtol value range to consider the value is acceptable?

This is indeed a tricky issue, generally I don’t think there is a good way to compare fp16 vs fp32 accuracy with atol/rtol, especially for e2e models. To evaluate FP16 accuracy, I don’t have a good suggestion other than evaluating some accuracy metric on real dataset (just like int8).

Yes, I fully agree with the scheme of using real datasets to verify accuracy. The original intention of my question is how to confirm the gap between different data types in UT.