When I try to connect an existing ray cluster in a worker node, it stucked without any output.
I removed other worker nodes and only kept one. From the head node, we can know the worker node is alive.
I captured some packets using tcpdump. It shows some incorrect cksum.
The firewall is runing in the system (centos 7). I opened some ports according to
https://docs.ray.io/en/latest/configure.html?highlight=port#all-nodes .
So, if it’s the random ports cause the problem, how can I set them to fix ports.
I stopped all firewalls, create ray cluster with a head node and one worker node.
Head Node (22 workers / 24 cores)
Worker Node (23 workers / 24 cores).
There are 524 ESTAB sockets between two nodes. And the details as follows:
24 ----------- 10023 (gcs-server-port)
212 ---------- 6379 (redis_port)
141 -----------6380 (redis-server)
141 ---------- 6381 (redis-server)
4 ----------- 12345 (object_manager_port)
1 ----------- 2049 (nfs server port)
1 ----------- 13119 (connected to worker node's node_manager_port)
524 in total
After an hour, I can’t connect ray cluster from worker node.
$ ss -a | grep | wc -l
# the changed ports
1 ----------- 10023 (gcs-server-port)
0 ----------- 12345 (object_manager_port)
Why its needs so many ports?
could you share more about your setup? It’s not clear whether you’re running on the cloud, on prem, or whatever the case. Could you also link to the way you’re setting up the Ray cluster?
The ray cluster is running on two docker containers (2 centos7 system on vmware).
I wrote a script so that it can auto connect when I restart containers.
Belows are the main commands in entrypoint.sh. The variables in braces get values from ENV. Each of them has a default value, except for address
(which worker node should connect to).
# head node
ray start --head --dashboard-host= --node-ip-address=${ip} --port=${redis_port} --redis-shard-ports=${redis_shard1},${redis_shard2} --num-cpus=${cpus} --object-manager-port=${object_manager_port} --node-manager-port=${node_manager_port} --min-worker-port ${min_worker_port} --max-worker-port ${max_worker_port} --gcs-server-port ${gcs_server_port} --block
# worker node
ray start --node-ip-address=${ip} --num-cpus=${cpus} --address=${address} --object-manager-port=${object_manager_port} --node-manager-port=${node_manager_port} --min-worker-port ${min_worker_port} --max-worker-port ${max_worker_port} --block
Then start containers.
# start head node container
sudo docker run -itd --net host --name ray-head --shm-size 30G --restart always -v /app:/code -e RAY_NODE_TYPE=head -e NUM_CPUS=22 ray:1.0.0-rllib-tune-tf1.15.2gpu-py3.6.9
# start worker node container
sudo docker run -itd --net host --name ray-worker --shm-size 30G --restart always -v /app:/code -e HEAD_HOST= ray:1.0.0-rllib-tune-tf1.15.2gpu-py3.6.9
December 15, 2020, 8:38am
Can you also print logs from raylet.err (in a broken worker node) and gcs_server.err?
When I try to connect from worker node, it will stuck and no outputs in raylet.err and gcs_server.err.
To be more precise, there is no new logs or outputs except a pyhon-core-driver-xxx.log file.
logs in worker node
# cat python-core-driver-01000000ffffffffffffffffffffffffffffffff.20201215-193721.546.log
I1215 19:37:21.118063 546 546 core_worker.cc:117] Constructing CoreWorkerProcess. pid: 546
When I try to connect from head node
logs in head node
# cat python-core-driver-02000000ffffffffffffffffffffffffffffffff.20201215-194841.610.log
I1215 19:48:41.396399 610 610 core_worker.cc:117] Constructing CoreWorkerProcess. pid: 610
I1215 19:48:41.397195 610 610 grpc_server.cc:74] driver server started, listening on port 10020.
I1215 19:48:41.399930 610 610 core_worker.cc:338] Initializing worker at address:, worker ID 02000000ffffffffffffffffffffffffffffffff, raylet 08941a1c2a9fb74ecd6213f6e3bbf6b62c4774f8
I1215 19:48:41.412799 610 610 core_worker.cc:212] Worker 02000000ffffffffffffffffffffffffffffffff is created.
I1215 19:48:41.413022 610 610 io_service_pool.cc:36] IOServicePool is running with 1 io_service.
I1215 19:48:41.413980 610 626 service_based_accessor.cc:791] Received notification for node id = 08941a1c2a9fb74ecd6213f6e3bbf6b62c4774f8, IsAlive = 1
I1215 19:48:41.414042 610 626 service_based_accessor.cc:791] Received notification for node id = 3dd2b650e990f305144e21c1266f5a9ac78ccd40, IsAlive = 1
W1215 19:48:51.421373 610 627 metric_exporter.cc:211] Export metrics to agent failed: IOError: 14: failed to connect to all addresses
W1215 19:48:51.421373 610 627 metric_exporter.cc:211] Export metrics to agent failed: IOError: 14: failed to connect to all addresses
W1215 19:49:01.429632 610 627 metric_exporter.cc:211] Export metrics to agent failed: IOError: 14: failed to connect to all addresses
W1215 19:49:01.429632 610 627 metric_exporter.cc:211] Export metrics to agent failed: IOError: 14: failed to connect to all addresses
W1215 19:49:11.438164 610 627 metric_exporter.cc:211] Export metrics to agent failed: IOError: 14: failed to connect to all addresses
W1215 19:49:11.438164 610 627 metric_exporter.cc:211] Export metrics to agent failed: IOError: 14: failed to connect to all addresses
W1215 19:49:21.446527 610 627 metric_exporter.cc:211] Export metrics to agent failed: IOError: 14: failed to connect to all addresses
W1215 19:49:21.446527 610 627 metric_exporter.cc:211] Export metrics to agent failed: IOError: 14: failed to connect to all addresses
W1215 19:49:31.455232 610 627 metric_exporter.cc:211] Export metrics to agent failed: IOError: 14: failed to connect to all addresses
W1215 19:49:31.455232 610 627 metric_exporter.cc:211] Export metrics to agent failed: IOError: 14: failed to connect to all addresses
W1215 19:49:41.463658 610 627 metric_exporter.cc:211] Export metrics to agent failed: IOError: 14: failed to connect to all addresses
W1215 19:49:41.463658 610 627 metric_exporter.cc:211] Export metrics to agent failed: IOError: 14: failed to connect to all addresses
W1215 19:49:51.472220 610 627 metric_exporter.cc:211] Export metrics to agent failed: IOError: 14: failed to connect to all addresses
W1215 19:49:51.472220 610 627 metric_exporter.cc:211] Export metrics to agent failed: IOError: 14: failed to connect to all addresses
# cat gcs_server.out
I1215 19:02:14.853004 34 34 io_service_pool.cc:36] IOServicePool is running with 1 io_service.
I1215 19:02:14.866667 34 34 gcs_redis_failure_detector.cc:30] Starting redis failure detector.
I1215 19:02:14.866946 34 34 gcs_object_manager.cc:271] Loading initial data.
I1215 19:02:14.867050 34 34 gcs_node_manager.cc:424] Loading initial data.
I1215 19:02:14.867504 34 34 gcs_object_manager.cc:286] Finished loading initial data.
I1215 19:02:14.867750 34 34 gcs_node_manager.cc:445] Finished loading initial data.
I1215 19:02:14.867771 34 34 gcs_actor_manager.cc:913] Loading initial data.
I1215 19:02:14.867983 34 34 gcs_actor_manager.cc:976] Finished loading initial data.
I1215 19:02:14.869662 34 34 grpc_server.cc:74] GcsServer server started, listening on port 10023.
I1215 19:02:14.977577 34 34 gcs_server.cc:260] Gcs server address =
I1215 19:02:14.977643 34 34 gcs_server.cc:264] Finished setting gcs server address:
I1215 19:02:15.133950 34 34 gcs_node_manager.cc:175] Registering node info, node id = 08941a1c2a9fb74ecd6213f6e3bbf6b62c4774f8, address =
I1215 19:02:15.134243 34 34 gcs_node_manager.cc:181] Finished registering node info, node id = 08941a1c2a9fb74ecd6213f6e3bbf6b62c4774f8, address =
I1215 19:02:15.135046 34 34 gcs_job_manager.cc:93] Getting all job info.
I1215 19:02:15.135159 34 34 gcs_job_manager.cc:99] Finished getting all job info.
I1215 19:02:22.879297 34 34 gcs_node_manager.cc:175] Registering node info, node id = 3dd2b650e990f305144e21c1266f5a9ac78ccd40, address =
I1215 19:02:22.879776 34 34 gcs_node_manager.cc:181] Finished registering node info, node id = 3dd2b650e990f305144e21c1266f5a9ac78ccd40, address =
I1215 19:02:22.883252 34 34 gcs_job_manager.cc:93] Getting all job info.
I1215 19:02:22.883749 34 34 gcs_job_manager.cc:99] Finished getting all job info.
I1215 19:48:41.397243 34 34 gcs_job_manager.cc:26] Adding job, job id = 02000000, driver pid = 610
I1215 19:48:41.397565 34 34 gcs_job_manager.cc:36] Finished adding job, job id = 02000000, driver pid = 610
No new outputs in worker node.
The only difference I know is the sockets which connected to head’s object_manager_port were closed.
>>> import ray
>>> ray.init(address="auto")
2020-12-15 21:44:42,867 INFO worker.py:634 -- Connecting to existing Ray cluster at address:
F1215 21:48:12.604311 336 336 raylet_client.cc:108] Check failed: _s.ok() [RayletClient] Unable to register worker with raylet.: IOError: No such file or directory
*** Check failure stack trace: ***
@ 0x7f2ea484a8ad google::LogMessage::Fail()
@ 0x7f2ea484ba0c google::LogMessage::SendToLog()
@ 0x7f2ea484a589 google::LogMessage::Flush()
@ 0x7f2ea484a7a1 google::LogMessage::~LogMessage()
@ 0x7f2ea4801949 ray::RayLog::~RayLog()
@ 0x7f2ea452a958 ray::raylet::RayletClient::RayletClient()
@ 0x7f2ea44c8c27 ray::CoreWorker::CoreWorker()
@ 0x7f2ea44ccdd4 ray::CoreWorkerProcess::CreateWorker()
@ 0x7f2ea44ce042 ray::CoreWorkerProcess::CoreWorkerProcess()
@ 0x7f2ea44cea0b ray::CoreWorkerProcess::Initialize()
@ 0x7f2ea4408bfe __pyx_pw_3ray_7_raylet_10CoreWorker_1__cinit__()
@ 0x7f2ea440a3e5 __pyx_tp_new_3ray_7_raylet_CoreWorker()
@ 0x551b15 (unknown)
@ 0x5aa6ec _PyObject_FastCallKeywords
@ 0x50abb3 (unknown)
@ 0x50c5b9 _PyEval_EvalFrameDefault
@ 0x508245 (unknown)
@ 0x50a080 (unknown)
@ 0x50aa7d (unknown)
@ 0x50d390 _PyEval_EvalFrameDefault
@ 0x50888b (unknown)
@ 0x50a080 (unknown)
@ 0x50aa7d (unknown)
@ 0x50d390 _PyEval_EvalFrameDefault
@ 0x508245 (unknown)
@ 0x50b403 PyEval_EvalCode
@ 0x635222 (unknown)
@ 0x4ad8e5 (unknown)
@ 0x4afd04 PyRun_InteractiveLoopFlags
@ 0x638c73 PyRun_AnyFileExFlags
@ 0x639631 Py_Main
@ 0x4b0f40 main