Why is auto tuning with resnet failing at task 1?

I’m following the compiling and automizing a model with tvmc tutorial and when I try to auto tune the resnet model I get this. I’m running macOS mojave version 10.14.6 and have installed the python dependencies needed for auto tuning.

[Task 1/25] Current/Best: 0.00/ 0.00 GFLOPS | Progress: (0/40) | 0.00 sTraceback (most recent call last): File “”, line 1, in File “/usr/local/Cellar/python@3.9/3.9.5/Frameworks/Python.framework/Versions/3.9/lib/python3.9/multiprocessing/spawn.py”, line 116, in spawn_main exitcode = _main(fd, parent_sentinel) File “/usr/local/Cellar/python@3.9/3.9.5/Frameworks/Python.framework/Versions/3.9/lib/python3.9/multiprocessing/spawn.py”, line 126, in _main self = reduction.pickle.load(from_parent)

I’ll check this out on macOS. Sadly, we may be running into a thread safety bug that is present on non-Linux platforms.

I’m running into a similar issue on Big Sur. I can’t reproduce your development environment since I’m not on Mojave and don’t have access to a Mojave machine, but this gist includes the stack trace of my same run. Maybe @leandron has some insight into what’s going on, but my only advice for you is to run Docker Desktop on your Mac if that’s a possibility, and do TVM work inside of a Docker container, which will give you a Linux environment which has far better support. You will likely need to increase the memory available to Docker to build TVM inside of a container.

1 Like

I’ll investigate and get back to you on this. Thanks for reporting @Codite @hogepodge

2 Likes

I think this is because of the way multiprocessing works on macs.

See multiprocessing — Process-based parallelism — Python 3.9.5 documentation

To fix simply add this line:

import multiprocessing as mp
mp.set_start_method("fork")
1 Like

Thanks Andrew. Is this a Mac specific fix, or will it work generally across all platforms if it’s added to tvmc? There’s similar behavior if we try to run any autoscheduling jobs in Jupyter Notebooks. We need to wrap them in __main__ to get them to work properly.

One more comment, from that document you listed:

Changed in version 3.8: On macOS, the spawn start method is now the default. The fork start method should be considered unsafe as it can lead to crashes of the subprocess. See bpo-33725.

I’m not fully versed in the underlying issue behind this, but it seems like we might be trading one problem for another.

A couple more items to follow up with, I attempted to add this patch to tvmc and ran into the same issue as before. There’s also a nice article about why macOS requires the use of spawn instead of fork.

I was able to get it to work by adding a slight change to Andrew’s solution:

import multiprocessing as mp
mp.set_start_method("fork", force=True)
2 Likes

Yeah, I’m not super well versed on this issue too but might read a little more into it this weekend.

I will say that I haven’t been able to find a workaround otherwise (but I haven’t tried very hard) and using “spawn” requires some TVM objects to be picklable which they are not.

1 Like

For most systems we care about (UNIX/LINUX) fork is the default behavior so this will not cause a change in those systems. For Mac it seems to work ok? I’m pretty sure this does cause errors on mac. For example, at runtime, I get a bug where we try to link to the same dynamic library twice which will cause an error. You can get around this by doing stuff like export OBJC_DISABLE_INITIALIZE_FORK_SAFETY=YES and it seems to not explode.

I have no idea what happens on windows and have little information on whether windows follows the fork-exec model assumed under the python multiprocessing framework.

1 Like

I’ve sent a patch up to attempt to address this issue. Would definitely appreciate feedback on if this is an appropriate solution to the problem. It both sets the start method and warns if the environment variable isn’t set. Switch threading model to `fork` on macOS by hogepodge · Pull Request #8347 · apache/tvm · GitHub

One more bit of feedback, apparently there’s active work to refactor away from using fork. It’s not complete, but here’s an example of the RPC worker switching to POpenWorker. I don’t have a deep understanding of the level of effort to do the same thing in tvmc, but it’s safe to say the approach I took in that pull request isn’t the correct or sustainable one.

@tkonolige probably also has things to say about this. The TL;DR is we need to change quite a few things to make things work. The hack up there is actually pretty unsafe and will cause a ton of bugs down the line.

1 Like

Draft PR here should allow us to avoid this bad hack: