How to judge a cuda kernel is good enough?

For example, I’m optimizing GEMV on GPU. The question is, how can I judge the kernel is efficient enough?

I found the roofline model and know the arithmetic intensity limits the FLOPS I can get.

But how can I judge the arithmetic intensity my kernel get is the upper bound ?

Thank you!