[CUDA] Enable half2 in CUDA injective schedule

Yes we should try to use divisible block/thread size, or at lease block/thread size such that the condition always holds for the inner loop so that it can be lifted outside the vectorized loop