I have deployed rpc tracker to k8s cluster, at the begining it looks like it was working:
- devices can connect to rpc tracker
- query rpc tracker results with free devices
- rpc is behind k8s’ service and is configured in deployment with one Pod
But when I do simple benchmark run, i’m getting following error:
Traceback (most recent call last):
File "/workspace/tcl_scripts/benchmark.py", line 106, in <module>
main(args)
File "/workspace/tcl_scripts/benchmark.py", line 64, in main
compile_upload_benchmark_model(args, mod, params, target)
File "/workspace/tcl_scripts/benchmark.py", line 35, in compile_upload_benchmark_model
args.rpc_key, args.rpc_tracker, args.rpc_port, timeout=500)
File "/workspace/python/tvm/autotvm/measure/measure_methods.py", line 735, in request_remote
remote = tracker.request(device_key, priority=priority, session_timeout=timeout)
File "/workspace/python/tvm/rpc/client.py", line 418, in request
"Cannot request %s after %d retry, last_error:%s" % (key, max_retry, str(last_err))
RuntimeError: Cannot request android after 5 retry, last_error:Traceback (most recent call last):
3: TVMFuncCall
2: std::_Function_handler<void (tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*), tvm::runtime::{lambda(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)#1}>::_M_invoke(std::_Any_data const&, tvm::runtime::TVMArgs&&, tvm::runtime::TVMRetValue*&&)
1: tvm::runtime::RPCClientConnect(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, tvm::runtime::TVMArgs)
0: tvm::runtime::RPCConnect(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, tvm::runtime::TVMArgs)
File "/workspace/src/runtime/rpc/rpc_socket_impl.cc", line 72
TVMError:
---------------------------------------------------------------
An error occurred during the execution of TVM.
For more information, please see: https://tvm.apache.org/docs/errors.html
---------------------------------------------------------------
Check failed: (sock.Connect(addr)) is false: Connect to 10.70.227.3:5001 failed
Service:
apiVersion: v1
kind: Service
metadata:
name: tvm-rpc-tracker-service
spec:
type: LoadBalancer
selector:
app: tvm-rpc-tracker
ports:
- name: rpc1
protocol: TCP
port: 9190
targetPort: 9190
- name: rpc2
protocol: TCP
port: 5000
targetPort: 5000
- name: rpc3
protocol: TCP
port: 5001
targetPort: 5001
- name: rpc4
protocol: TCP
port: 5002
targetPort: 5002
- name: rpc5
protocol: TCP
port: 5003
targetPort: 5003
Deployment:
apiVersion: apps/v1
kind: Deployment
metadata:
name: tvm.rpc-tracker-deployment
labels:
app: tvm-rpc-tracker
spec:
replicas: 1
selector:
matchLabels:
app: tvm-rpc-tracker
template:
metadata:
labels:
app: tvm-rpc-tracker
spec:
nodeSelector:
location: dc
containers:
- name: tvm
image: tvm:0.0.3
command: ["/bin/bash", "-ec", "/usr/bin/python3 -m tvm.exec.rpc_tracker --host=0.0.0.0 --port=9190"]
ports:
- containerPort: 9190
- containerPort: 5000
- containerPort: 5001
- containerPort: 5002
- containerPort: 5003
Question, how many 500* ports should I open/forward? Does all of them should be TCP? Have an idea how to debug it? I spot strage behaviour:
- when I have not k8s setup, connected rpc servers are visiable in rpc tracker with correct IP addresses
- when I have k8s, then connected rpc servers are visible in rpc tracker with IP address of k8s node, on which is working Pod with rpc tracker