Trouble with RPC session

I have some troubles with RPC session on Windows.

Currently when RPC session is created it will be shut down automatically when reference count goes to 0. However, relying on this seems to lead to some inconsistent results. Sometimes python fails to delete the session and it gets stuck untill it is killed by server timeout. And because of this, tuning process becomes incredibly slow as it is impossible to create a new session for LocalRunnner until previous session times out. If the runner’s timeout is 20s then each measurement will take 20s.

I did some tests on different python versions and it seems like from python 3.7.2 till 3.8.10 the RPC session is not deleted properly.

I think that TVM should not rely that much on reference count, at least not for RPC session. There should be a close method and it should be have enter and leave methods so that it can be used as a context object.

Python 3.7.1 (default, Dec 10 2018, 22:54:23) [MSC v.1915 64 bit (AMD64)] :: Anaconda, Inc. on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> import gc
>>> from tvm import rpc
>>> from tvm.rpc.tracker import Tracker
>>> from tvm.rpc.server import Server
>>> device_key="test"
>>> tracker=Tracker(host="127.0.0.1")
2022-11-27 14:45:49.046 INFO bind to 127.0.0.1:9190
>>> server=Server(host="127.0.0.1",key=device_key,tracker_addr=("127.0.0.1",tracker.port))
2022-11-27 14:45:49.519 INFO bind to 127.0.0.1:9091
>>> def check_remote():
...   t = rpc.connect_tracker("127.0.0.1", tracker.port)
...   r = t.request(device_key)
...   return r
...
>>> r = check_remote()
2022-11-27 14:45:49.541 INFO connection from ('127.0.0.1', 8593)
>>> print(sys.getrefcount(r))
2
>>> del r
2022-11-27 14:45:50.021 INFO Finish serving ('127.0.0.1', 8593)
>>> gc.collect(0)
28
>>> gc.collect(1)
0
>>> gc.collect(2)
55
>>> del server
2022-11-27 14:45:50.078 WARNING TCPSocket: ('127.0.0.1', 8591): Error in RPC Tracker: [WinError 10054] An existing connection was forcibly closed by the remote host
>>> del tracker
2022-11-27 14:45:50.197 INFO Terminating Tracker Server...
>>> gc.collect()
0
>>> exit()

Python 3.7.2 (default, Feb 21 2019, 17:35:59) [MSC v.1915 64 bit (AMD64)] :: Anaconda, Inc. on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> import gc
>>> from tvm import rpc
>>> from tvm.rpc.tracker import Tracker
>>> from tvm.rpc.server import Server
>>> device_key="test"
>>> tracker=Tracker(host="127.0.0.1")
2022-11-27 14:40:25.606 INFO bind to 127.0.0.1:9190
>>> server=Server(host="127.0.0.1",key=device_key,tracker_addr=("127.0.0.1",tracker.port))
2022-11-27 14:40:26.085 INFO bind to 127.0.0.1:9091
>>> def check_remote():
...   t = rpc.connect_tracker("127.0.0.1", tracker.port)
...   r = t.request(device_key)
...   return r
...
>>> r = check_remote()
2022-11-27 14:40:26.108 INFO connection from ('127.0.0.1', 8362)
>>> print(sys.getrefcount(r))
3
>>> del r
>>> gc.collect(0)
54
2022-11-27 14:40:26.595 INFO Finish serving ('127.0.0.1', 8362)
>>> gc.collect(1)
0
>>> gc.collect(2)
55
>>> del server
2022-11-27 14:40:26.653 WARNING TCPSocket: ('127.0.0.1', 8360): Error in RPC Tracker: [WinError 10054] An existing connection was forcibly closed by the remote host
>>> del tracker
2022-11-27 14:40:26.772 INFO Terminating Tracker Server...
>>> gc.collect()
27
>>> exit()

Python 3.7.15 (default, Nov 24 2022, 18:44:54) [MSC v.1916 64 bit (AMD64)] :: Anaconda, Inc. on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> import gc
>>> from tvm import rpc
>>> from tvm.rpc.tracker import Tracker
>>> from tvm.rpc.server import Server
>>> device_key="test"
>>> tracker=Tracker(host="127.0.0.1")
2022-11-27 14:25:54.190 INFO bind to 127.0.0.1:9190
>>> server=Server(host="127.0.0.1",key=device_key,tracker_addr=("127.0.0.1",tracker.port))
2022-11-27 14:25:54.662 INFO bind to 127.0.0.1:9091
>>> def check_remote():
...   t = rpc.connect_tracker("127.0.0.1", tracker.port)
...   r = t.request(device_key)
...   return r
...
>>> r = check_remote()
2022-11-27 14:25:54.683 INFO connection from ('127.0.0.1', 8104)
>>> print(sys.getrefcount(r))
3
>>> del r
>>> gc.collect(0)
54
2022-11-27 14:25:55.169 INFO Finish serving ('127.0.0.1', 8104)
>>> gc.collect(1)
0
>>> gc.collect(2)
55
>>> del server
2022-11-27 14:25:55.225 WARNING TCPSocket: ('127.0.0.1', 8102): Error in RPC Tracker: [WinError 10054] An existing connection was forcibly closed by the remote host
>>> del tracker
2022-11-27 14:25:55.343 INFO Terminating Tracker Server...
>>> gc.collect()
27
>>> exit()

Python 3.8.8 (default, Apr 13 2021, 15:08:03) [MSC v.1916 64 bit (AMD64)] :: Anaconda, Inc. on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> import gc
>>> from tvm import rpc
>>> from tvm.rpc.tracker import Tracker
>>> from tvm.rpc.server import Server
>>> device_key="test"
>>> tracker=Tracker(host="127.0.0.1")
2022-11-27 15:09:58.026 INFO bind to 127.0.0.1:9190
>>> server=Server(host="127.0.0.1",key=device_key,tracker_addr=("127.0.0.1",tracker.port))
2022-11-27 15:09:58.601 INFO bind to 127.0.0.1:9091
>>> def check_remote():
...   t = rpc.connect_tracker("127.0.0.1", tracker.port)
...   r = t.request(device_key)
...   return r
...
>>> r = check_remote()
2022-11-27 15:09:58.626 INFO connection from ('127.0.0.1', 8914)
>>> print(sys.getrefcount(r))
3
>>> del r
>>> gc.collect(0)
49
2022-11-27 15:09:59.148 INFO Finish serving ('127.0.0.1', 8914)
>>> gc.collect(1)
0
>>> gc.collect(2)
0
>>> del server
2022-11-27 15:09:59.209 WARNING TCPSocket: ('127.0.0.1', 8912): Error in RPC Tracker: [WinError 10054] An existing connection was forcibly closed by the remote host
>>> del tracker
2022-11-27 15:09:59.321 INFO Terminating Tracker Server...
>>> gc.collect()
22
>>> exit()

Python 3.8.10 (default, May 19 2021, 13:12:57) [MSC v.1916 64 bit (AMD64)] :: Anaconda, Inc. on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> import gc
>>> from tvm import rpc
>>> from tvm.rpc.tracker import Tracker
>>> from tvm.rpc.server import Server
>>> device_key="test"
>>> tracker=Tracker(host="127.0.0.1")
2022-11-27 15:23:13.536 INFO bind to 127.0.0.1:9190
>>> server=Server(host="127.0.0.1",key=device_key,tracker_addr=("127.0.0.1",tracker.port))
2022-11-27 15:23:14.063 INFO bind to 127.0.0.1:9091
>>> def check_remote():
...   t = rpc.connect_tracker("127.0.0.1", tracker.port)
...   r = t.request(device_key)
...   return r
...
>>> r = check_remote()
2022-11-27 15:23:14.091 INFO connection from ('127.0.0.1', 9099)
>>> print(sys.getrefcount(r))
2
>>> del r
2022-11-27 15:23:14.684 INFO Finish serving ('127.0.0.1', 9099)
>>> gc.collect(0)
28
>>> gc.collect(1)
0
>>> gc.collect(2)
0
>>> del server
2022-11-27 15:23:14.749 WARNING TCPSocket: ('127.0.0.1', 9097): Error in RPC Tracker: [WinError 10054] An existing connection was forcibly closed by the remote host
>>> del tracker
2022-11-27 15:23:14.862 INFO Terminating Tracker Server...
>>> gc.collect()
0
>>> exit()

So, why the session is not automatically deleted without invoking GC when it looks like there should be no circular dependencies.

I looked deeper and found that there is one.

    def check_remote():
       t = rpc.connect_tracker("127.0.0.1", tracker.port)
       r = t.request(device_key)
       return r

The code is modeled after autotvm.measure.measure_methods.request_remote.

The important part is that we first assign r = t.request(device_key) and only then return the session.

During first line rpc.connect_tracker("127.0.0.1", tracker.port) there was a ValueError exception inside socket.getaddrinfo during enum checks (sock_kind is 0). The exception is handled internally so we don’t see it.

However, in 3.7.2 this change was introduced: bpo-34536. It created a circular dependency in traceback object which eventually leads to our frame in check_remote. But the frame holds a reference to our session (local variable r).

So the session cannot be deleted without invoking GC.

The problem was fixed in 3.8.10 with bpo-42248.

This just shows that it is unsafe to rely on reference counter, some unrelated change can easily break somehing.