RPC AutoScheduling keeps timeout

I am trying to tune a model using autoschedule with RPC for a Jetson NX board on a x86 PC with RTX 3080.

This is what the board keeps reporting:

2023-12-06 17:58:57.199 INFO connected from ('[x86 PC IP]', 35042)
2023-12-06 17:58:57.203 INFO start serving at /tmp/tmpb2upll9w
2023-12-06 17:59:07.223 INFO timeout in RPC session, kill..
2023-12-06 17:59:07.280 INFO finish serving ('[x86 PC IP]', 35042)
2023-12-06 17:59:07.375 INFO connected from ('[x86 PC IP]', 40966)
2023-12-06 17:59:07.380 INFO start serving at /tmp/tmplh201tek
2023-12-06 17:59:08.006 INFO finish serving ('[x86 PC IP]', 40966)
2023-12-06 17:59:08.102 INFO connected from ('[x86 PC IP]', 40978)
2023-12-06 17:59:08.108 INFO start serving at /tmp/tmp0h0uxuzy
2023-12-06 17:59:18.139 INFO timeout in RPC session, kill..
2023-12-06 17:59:18.212 INFO finish serving ('[x86 PC IP]', 40978)

With timeout set to 60 seconds, the RPC session still times out.

The way I set up the RPC system:

  1. Run a tracker on x86 PC by executing python3 -m tvm.exec.rpc_tracker --host=0.0.0.0 --port=9190
  2. Run a server on the board by executing python3 -m tvm.exec.rpc_server --tracker=[x86 PC IP]:9190 --key=jetson
  3. I can see the jetson board by executing python3 -m tvm.exec.query_rpc_tracker --host=0.0.0.0 --port=9190 on x86 PC
  4. So I run my autoschedule script with code like this:
mod, params = relay.frontend.from_onnx(...)
target = tvm.target.cuda(arch="sm_72")      # 72 For NX
tasks, task_weights = auto_scheduler.extract_tasks(
   mod["main"], target=target, params=params,)
tuner = auto_scheduler.TaskScheduler(tasks, task_weights)
tune_option = auto_scheduler.TuningOptions(
     num_measure_trials=1000,
     runner=auto_scheduler.RPCRunner(
          key='jetson', host='127.0.0.1', port='9190', number=10, timeout=10),
     measure_callbacks=[auto_scheduler.RecordToFile(logPath)], )
tuner.tune(tune_option)

The TVM version: 0.15.dev0

I could tune the model with CUDA on board locally, but it is too slow. With RPC, it keeps saying timeout. Any suggestions?

     runner=auto_scheduler.RPCRunner(
          key='jetson', host='127.0.0.1', port='9190', number=10, timeout=10),

Can you try to set the host to “0.0.0.0” instead of “127.0.0.1”?

Hello @BitCircuit , I’m not an expert, but you can check few things:

  1. On jetson device, check if you have permissions to write files where you run rpc_server? rpc_server should have argument to specify workdir.
  2. In TVM/python/tvm/rpc/server.py, your error is followed by this message: f'RPCSessionTimeoutError: Your {opts["timeout"]}s session has expired, 'try to increase the "session_timeout" value.
  3. In TVM there are many ways to get logs/increase verbosity. You can try to use one of them

Thank you for your reply. After trying it, RPC session still keeps timeout (I tried to increase timeout limit to 120 seconds, same problem)

Thank you for your reply and advises. I have checked:

  1. I went through the python script tvm/python/tvm/exec/rpc_server.py at main · apache/tvm (github.com), I did not find any arguments related to specify work directory. I run rpc_server in home directory which I should have permissions.
  2. I tried to increase the timeout value to 120 seconds, same problem.
  3. I tried to change https://github.com/apache/tvm/blob/b3eec91ee6254b40920c40e922cb3c37ac1c06a4/python/tvm/exec/rpc_server.py#L96C44-L96C44 from INFO to DEBUG, nothing extra has been printed.
  1. I was thinking about c++ version of RPC. But now it does not matter if you run rpc_server in home directory.
  2. How do you setup TVM in Jetson board? Is it whole TVM, not only tvm_runtime.so, right? I think jetson is too weak platform to quickly compile modules, so you may need to increase timeout even more. Or use cross-compilation (compile on your x86 server to run on target arm jetson).

Yes, it is whole TVM. Since I could not RPC to the board (I think RPC is the only way to cross-compile, right?), I run the AutoSchedule on board. The time cost is pretty high. For a kinda simple model with 1x3x1024x600 input size, the time costed by autoschedule is roughly 4 hrs for 1000 trials and 56 hrs for 20000 trials.

As of increasing timeout, I tried 30 mins, RPC sessions still time out. However, when I do auto-schedule on board, measurement stage takes ~120 seconds. If nothing wrong with my way setting up the RPC system, I suspect there may some bugs in RPC module.