RPC does not work with SSH tunnel

lmxyy · May 19, 2021, 1:57pm

Hi, I just found my RPC did not work with SSH tunnel. I ran the RPC tracker on my own Mac:

python -m tvm.exec.rpc_tracker --host=0.0.0.0 --port=9190

and created the ssh tunnel on my Jetson Nano in the same local area network:

ssh -C -N -L 9190:0.0.0.0:9190 limuyang@192.168.99.136  #192.168.99.136 is the intranet IP of my Mac

and ran the RPC server on it with the mapped port:

python3 -m tvm.exec.rpc_server --tracker=localhost:9190 --key=nano --no-fork

Then I ran

python -m tvm.exec.query_rpc_tracker --host 0.0.0.0 --port 9190

on my Mac and found my Nano was successfully registered to the tracker as follows:

Tracker address 0.0.0.0:9190

Server List
----------------------------
server-address  key
----------------------------
127.0.0.1:39544 server:nano
----------------------------

Queue Status
----------------------------
key    total  free  pending
----------------------------
nano   1      1     0      
----------------------------

However, when I checked the remote device with the following python script on my Mac

from tvm import auto_scheduler

if __name__ == '__main__':
    auto_scheduler.measure.check_remote('nano', '0.0.0.0', 9190, 1)

I found my Mac could not connect to my Nano. On my Mac, it said

Exception in thread Thread-1:
Traceback (most recent call last):
  File "/opt/conda/envs/inference/lib/python3.9/threading.py", line 954, in _bootstrap_inner
    self.run()
  File "/opt/conda/envs/inference/lib/python3.9/threading.py", line 892, in run
    self._target(*self._args, **self._kwargs)
  File "/opt/conda/envs/inference/lib/python3.9/site-packages/tvm-0.8.dev1010+g711a603db-py3.9-linux-x86_64.egg/tvm/auto_scheduler/utils.py", line 389, in _check
    request_remote(device_key, host, port, priority)
  File "/opt/conda/envs/inference/lib/python3.9/site-packages/tvm-0.8.dev1010+g711a603db-py3.9-linux-x86_64.egg/tvm/auto_scheduler/utils.py", line 359, in request_remote
    remote = tracker.request(device_key, priority=priority, session_timeout=timeout)
  File "/opt/conda/envs/inference/lib/python3.9/site-packages/tvm-0.8.dev1010+g711a603db-py3.9-linux-x86_64.egg/tvm/rpc/client.py", line 400, in request
    raise RuntimeError(
RuntimeError: Cannot request nano after 5 retry, last_error:Traceback (most recent call last):
  3: TVMFuncCall
  2: std::_Function_handler<void (tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*), tvm::runtime::{lambda(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)#1}>::_M_invoke(std::_Any_data const&, tvm::runtime::TVMArgs&&, tvm::runtime::TVMRetValue*&&)
  1: tvm::runtime::RPCClientConnect(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, tvm::runtime::TVMArgs)
  0: tvm::runtime::RPCConnect(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, tvm::runtime::TVMArgs)
  File "/home/lmxyy1999/tvm/src/runtime/rpc/rpc_socket_impl.cc", line 73
TVMError: 
---------------------------------------------------------------
An error occurred during the execution of TVM.
For more information, please see: https://tvm.apache.org/docs/errors.html
---------------------------------------------------------------
  Check failed: (sock.Connect(addr)) is false: Connect to 127.0.0.1:9090 failed

and on my Nano, the server log was

INFO:RPCServer:bind to 0.0.0.0:9090
INFO:RPCServer:no incoming connections, regenerate key ...
INFO:RPCServer:no incoming connections, regenerate key ...
INFO:RPCServer:no incoming connections, regenerate key ...
INFO:RPCServer:no incoming connections, regenerate key ...
INFO:RPCServer:no incoming connections, regenerate key ...

lmxyy · May 19, 2021, 2:12pm

But if I ran the RPC Server without ssh tunnel with the following command:

python3 -m tvm.exec.rpc_server --tracker=192.168.99.136:9190 --key=nano --no-fork

and checked the connection on my Mac, I found it worked very well.

INFO:RPCServer:bind to 0.0.0.0:9090                                                                                                                                                      
INFO:RPCServer:connection from ('192.168.99.136', 52943)

Since I need to set up the RPC Tracker on GCP, where exposing the ports always require lots of permissions, I want to set up the RPC with SSH tunnel. I wonder if I set the SSH tunnel correctly? Has anyone met this issue before? How to solve this problem?

areusch · May 19, 2021, 4:06pm

The problem happening here is that the tracker provides the client with server-address, and the client needs to be able to connect to that. So when you use SSH as a bridge, 127.0.0.1 likely means something different to the client than to the RPC tracker. RPC server does provide --custom-addr to work around this, but unfortunately this only extends to the IP and not the port.

Is it possible to accomplish your task without using the RPC tracker? It should be possible to port-forward the RPC server alone without issue.

lmxyy · May 19, 2021, 5:12pm

Hi, I think the SSH tunnel method

ssh -C -N -L 9190:0.0.0.0:9190 limuyang@192.168.99.136  #192.168.99.136 is the intranet IP of my Mac

I used has already forward the port of my Nano to my Mac. Visiting localhost:9190 should be equal to visit 192.168.99.136:9190. So I do not quite understand what do you mean by “when you use SSH as a bridge, 127.0.0.1 likely means something different to the client than to the RPC tracker.”

Besides, I do not understand how --custom-addr work? In what case should I provide this argument?

I also do not know how to accomplish my task without using the RPC tracker. Could you elaborate more about “It should be possible to port-forward the RPC server alone without issue.”?

Thank you very much!

areusch · May 19, 2021, 5:42pm

When you connect to a RPC server via the RPC tracker, you ask the RPC tracker to tell you the IP address and port of an RPC server that meets your criteria (e.g. key). Because the RPC tracker in this instance lives behind an SSH proxy, the IP address given to the client by the RPC tracker may not be meaningful. For example, if the tracker lives on a GCP network with IP address 192.168.1.2, but the client is not in GCP, the tracker can’t tell the client to connect to address 192.168.1.3, because the client is on a different network.

I think the best solution for the GCP issue would be to modify TrackerSession to allow you to override the IP/port used to connect. You could pass in a callback function e.g.

def override_rpc_server_ip(ip_from_tracker : str, port_from_tracker : int) -> (str, int):
  # Do any SSH port-forwarding needed
  return ('127.0.0.1', <local_port>)

I don’t have bandwidth for this now, but would be happy to review any PR that did that.

lmxyy · May 20, 2021, 5:45am

I wonder if we could schedule a talk to discuss more details about this issue? I’ve sent an email to you.

lmxyy · May 21, 2021, 6:21am

Hi, I found an effortless way to address this issue. When my server (Nano) connects to the tracker (Mac) via SSH Tunnel, the tracker would think the IP of the server is 0.0.0.0. Supposing your RPC Server binds to port 9090, the tracker would send the message to the server at 0.0.0.0:9090, which is not the server address. So what you need to do is explicitly created another SSH Tunnel to forward the information to 0.0.0.0:9090 to the RPC Server. For example, on the Nano side, you could run:

ssh -C -N -R 9090:0.0.0.0:9090 limuyang@192.168.99.136

Then it would work.

lmxyy · May 21, 2021, 6:27am

Another thing I need to mention is that when you run

python -m tvm.exec.query_rpc_tracker --host 0.0.0.0 --port 9190

on your tracker machine, you would get

Tracker address 0.0.0.0:9190

Server List
----------------------------
server-address  key
----------------------------
127.0.0.1:39544 server:nano
----------------------------

Queue Status
----------------------------
key    total  free  pending
----------------------------
nano   1      1     0      
----------------------------

I think the server-address is somewhat confusing, especially its port. My server binds to port 9090 instead of 39544. I think that it would be much better if the server address is 127.0.0.1:9090 here. Maybe I could make a PR to fix this.