In CUDA’s default injective schedule, the thread count is set to tvm.target.current_target(allow_none=False).max_num_threads. This leads to a huge thread count per block (1024 in the case of the M60 GPU I am testing on). This is not optimal for all ops.
For example, I am currently testing a softmax with input shape (10, 12, 512, 512), and found that 64 threads per block saved over 10ms.
On the flip side, I have found other ops that benefit from having this large thread count.
What do you think is the best way to resolve this? Can we auto-tune this value per op that uses the injective schedule? Should we allow ops to pass in their ideal thread count?
I feel like the right idea would be to parameterize and auto-tune thread count for ops that use the default injective schedule. Is there a good way to do this? Otherwise, we can just copy the injective schedule into softmax.py and do the auto-tuning there, but that’s less than ideal.
There is no deep reason for why we use 1024 threads per block on CUDA. I’m +1 for making this number tunable, as long as the default is 1024 (to avoid perf regression).
@vinx13 I finally got around to trying to create a tunable schedule for CUDA softmax, but am seeing a strange error when I try to tune it: “cannot find workload in attribute of this schedule”.
My branch is here. I added the necessary lines to topi_integration and relay_integration, and updated nn/softmax.py to have the compute decorated with @tvm.target.generic_func. I then updated cuda/softmax.py to decorate the schedule with:
From some initial investigation, the functions that are registered in register_topi_compute seem to never be called (like config_dispatcher and template_call), leading to the workload attribute never being set. However, I can’t figure out why they’re not be called.
This is a issue with compute registration. Although you declare softmax as generic func, the compute for softmax in Relay is defined in C++ side. So the python generic func is never invoked. See my patch below