Yes we should try to use divisible block/thread size, or at lease block/thread size such that the condition always holds for the inner loop so that it can be lifted outside the vectorized loop
Yes we should try to use divisible block/thread size, or at lease block/thread size such that the condition always holds for the inner loop so that it can be lifted outside the vectorized loop