Sorry, I checked the code and found kMaxRegistersPerBlock is actually used here.
max_local_memory_per_block. But I think this is a bug. The local_memory_per_block in the VerifyGPUCode is not the same as registers.
For CUDA, it does not matter, because kMaxRegistersPerBlock returns a very large value similar to kMaxSharedMemoryPerBlock. So this check just does nothing.
For your AMD GPU, I suggest setting it to 65536 (the same as kMaxSharedMemoryPerBlock). If you use a value too small such as 1024 in your case. The VerifyGPUCode will filter out many good candidates.
To summarize, we can
- use NHWC layout with winograd by copying op strategy from CUDA.
- use n_trials > 20000
- set
kMaxRegistersPerBlockto 65536