Currently autotvm does not work out of the box. A few small code changes you can make it work by turning off forking in LocalExecutor (local_executor.py), but it is VERY slow, executing things synchronously. I wanted close to the speed of Linux
I’m not submitting this as formal RFC or PR for this because:
- I’m new to Python and it shows
- I’m sensitive to the fact that these changes put burden on testing and maintainability.
- I’m exhausted at the moment. The code changes look small, but it was Mt. Everest for me.
The bulk of the problems running a lot of tvm’s python code is Windows doesn’t fork, and multiprocessing python lib doesn’t pickle important things, like functions.
For Windows, I used the pathos python library, that uses dill vs pickle. This allows to overcome all the pickling errors. Also, for better or worse, I used a process pool for the LocalExecutor vs Process. Process.start() was taking up to 2 seconds, most parallelism was lost.
There were also some small weird platform differences, but this one took me at least 30 hours to figure out (not joking), but like I said, I’m new to Python :
I still don’t know what the underlying issue is, but before any code changes, the RPCSession would never get released, which would never release the native code reference count, leaving the socket open, which caused the rpc server to hit the time out. The above changes aren’t fully correct, but it fixed everything.
I’m least proud of what I had to do to the xgboost_cost_model.py, but it works in parallel and hits 85% of my 16 core intel.
Anyways, here’s my commit on my fork. I “over commented” things to explain some thought process. I also tried my best to keep the Linux code behavior the same.
I’ll probably write up a “TVM on Windows” tutorial eventually.
Thanks again for this awesome project! I’m really happy I can autotvm cuda on Windows now!