I am autotuning the TVM Testing MobileNet with the main application (autotuning loop, building) and RPC tracker running on one server, and multiple RPC servers for remote execution/measurement running on two other physical servers with the same GPU model.
Occasionally I get this error, which then makes the current task fail:
File "/usr/tvm/python/tvm/autotvm/measure/measure_methods.py", line 235, in get_build_kwargs
remote = request_remote(self.key, self.host, self.port)
File "/usr/tvm/python/tvm/autotvm/measure/measure_methods.py", line 535, in request_remote
File "/usr/tvm/python/tvm/rpc/client.py", line 329, in request
key, max_retry, str(last_err)))
RuntimeError: Cannot request k80 after 5 retry, last_error:Traceback (most recent call last):
[bt] (4) /usr/tvm/build/libtvm.so(TVMFuncCall+0x65) [0x7f721515d935]
[bt] (3) /usr/tvm/build/libtvm.so(+0x9b5eb4) [0x7f72151beeb4]
[bt] (2) /usr/tvm/build/libtvm.so(+0x9b3677) [0x7f72151bc677]
[bt] (1) /usr/tvm/build/libtvm.so(+0x9aeb74) [0x7f72151b7b74]
[bt] (0) /usr/tvm/build/libtvm.so(+0x153863) [0x7f721495c863]
File "/usr/tvm/src/runtime/rpc/rpc_socket_impl.cc", line 80
TVMError: URL server:9104 cannot find server that matches key=client:k80:0.7114652791680667 -timeout=60
I get a lot of these messages in the log of the RPC server:
mismatch key from ('', 47672)
no incoming connections, regenerate key ...
However, this does not happen when I only use RPC servers on one physical servers. Maybe all of these messages are related.
Does anyone know what I might do to fix this?
What is the reason that match keys expire? To prevent clients from hogging servers even if they’re not using them? If so, could the unmatch_timeout
of the server be made configurable?