How to bind 'L' dim to blockIdx with matmul

The Matmul example in the auto tunning, N and M dim can be binding to blockIdx and threadIdx, but L dim can only be binding to threadIdx, which means L dim reduction can only be calculated in block. If I want to calculate L dim reduction with multiple blocks, how can I do?