How to avoid allocating the wrong amount of registers in CUDA scheduling in this example?

Any advice here? @FrozenGene