[ray1.0.0] stuck when connecting to existing ray cluster

When I try to connect an existing ray cluster in a worker node, it stucked without any output.

I removed other worker nodes and only kept one. From the head node, we can know the worker node is alive.

I captured some packets using tcpdump. It shows some incorrect cksum.


The firewall is runing in the system (centos 7). I opened some ports according to https://docs.ray.io/en/latest/configure.html?highlight=port#all-nodes.

So, if it’s the random ports cause the problem, how can I set them to fix ports.

I stopped all firewalls, create ray cluster with a head node and one worker node.

Head Node (22 workers / 24 cores)
Worker Node (23 workers / 24 cores).

There are 524 ESTAB sockets between two nodes. And the details as follows:

24 ----------- 10023 (gcs-server-port)
212 ---------- 6379  (redis_port)
141 -----------6380  (redis-server)
141 ---------- 6381  (redis-server)
4  ----------- 12345  (object_manager_port)
1  ----------- 2049    (nfs server port)
1  ----------- 13119  (connected to worker node's node_manager_port)

524 in total

After an hour, I can’t connect ray cluster from worker node.

$ ss -a | grep 192.168.250.10 | wc -l
497

# the changed ports
1 ----------- 10023 (gcs-server-port)
0  ----------- 12345  (object_manager_port)

Why its needs so many ports?

could you share more about your setup? It’s not clear whether you’re running on the cloud, on prem, or whatever the case. Could you also link to the way you’re setting up the Ray cluster?

The ray cluster is running on two docker containers (2 centos7 system on vmware).

I wrote a script so that it can auto connect when I restart containers.
Belows are the main commands in entrypoint.sh. The variables in braces get values from ENV. Each of them has a default value, except for address (which worker node should connect to).

# head node
ray start --head --dashboard-host=0.0.0.0 --node-ip-address=${ip} --port=${redis_port} --redis-shard-ports=${redis_shard1},${redis_shard2} --num-cpus=${cpus} --object-manager-port=${object_manager_port} --node-manager-port=${node_manager_port} --min-worker-port ${min_worker_port} --max-worker-port ${max_worker_port} --gcs-server-port ${gcs_server_port} --block

# worker node
ray start --node-ip-address=${ip} --num-cpus=${cpus} --address=${address} --object-manager-port=${object_manager_port} --node-manager-port=${node_manager_port} --min-worker-port ${min_worker_port} --max-worker-port ${max_worker_port} --block

Then start containers.

# start head node container
sudo docker run -itd --net host --name ray-head --shm-size 30G --restart always -v /app:/code -e RAY_NODE_TYPE=head -e NUM_CPUS=22 ray:1.0.0-rllib-tune-tf1.15.2gpu-py3.6.9

# start worker node container
sudo docker run -itd --net host --name ray-worker --shm-size 30G --restart always -v /app:/code -e HEAD_HOST=192.168.250.10 ray:1.0.0-rllib-tune-tf1.15.2gpu-py3.6.9

Can you also print logs from raylet.err (in a broken worker node) and gcs_server.err?

When I try to connect from worker node, it will stuck and no outputs in raylet.err and gcs_server.err.
To be more precise, there is no new logs or outputs except a pyhon-core-driver-xxx.log file.
logs in worker node

# cat python-core-driver-01000000ffffffffffffffffffffffffffffffff.20201215-193721.546.log
I1215 19:37:21.118063   546   546 core_worker.cc:117] Constructing CoreWorkerProcess. pid: 546

When I try to connect from head node
logs in head node

# cat python-core-driver-02000000ffffffffffffffffffffffffffffffff.20201215-194841.610.log
I1215 19:48:41.396399   610   610 core_worker.cc:117] Constructing CoreWorkerProcess. pid: 610
I1215 19:48:41.397195   610   610 grpc_server.cc:74] driver server started, listening on port 10020.
I1215 19:48:41.399930   610   610 core_worker.cc:338] Initializing worker at address: 192.168.250.10:10020, worker ID 02000000ffffffffffffffffffffffffffffffff, raylet 08941a1c2a9fb74ecd6213f6e3bbf6b62c4774f8
I1215 19:48:41.412799   610   610 core_worker.cc:212] Worker 02000000ffffffffffffffffffffffffffffffff is created.
I1215 19:48:41.413022   610   610 io_service_pool.cc:36] IOServicePool is running with 1 io_service.
I1215 19:48:41.413980   610   626 service_based_accessor.cc:791] Received notification for node id = 08941a1c2a9fb74ecd6213f6e3bbf6b62c4774f8, IsAlive = 1
I1215 19:48:41.414042   610   626 service_based_accessor.cc:791] Received notification for node id = 3dd2b650e990f305144e21c1266f5a9ac78ccd40, IsAlive = 1
W1215 19:48:51.421373   610   627 metric_exporter.cc:211] Export metrics to agent failed: IOError: 14: failed to connect to all addresses
W1215 19:48:51.421373   610   627 metric_exporter.cc:211] Export metrics to agent failed: IOError: 14: failed to connect to all addresses
W1215 19:49:01.429632   610   627 metric_exporter.cc:211] Export metrics to agent failed: IOError: 14: failed to connect to all addresses
W1215 19:49:01.429632   610   627 metric_exporter.cc:211] Export metrics to agent failed: IOError: 14: failed to connect to all addresses
W1215 19:49:11.438164   610   627 metric_exporter.cc:211] Export metrics to agent failed: IOError: 14: failed to connect to all addresses
W1215 19:49:11.438164   610   627 metric_exporter.cc:211] Export metrics to agent failed: IOError: 14: failed to connect to all addresses
W1215 19:49:21.446527   610   627 metric_exporter.cc:211] Export metrics to agent failed: IOError: 14: failed to connect to all addresses
W1215 19:49:21.446527   610   627 metric_exporter.cc:211] Export metrics to agent failed: IOError: 14: failed to connect to all addresses
W1215 19:49:31.455232   610   627 metric_exporter.cc:211] Export metrics to agent failed: IOError: 14: failed to connect to all addresses
W1215 19:49:31.455232   610   627 metric_exporter.cc:211] Export metrics to agent failed: IOError: 14: failed to connect to all addresses
W1215 19:49:41.463658   610   627 metric_exporter.cc:211] Export metrics to agent failed: IOError: 14: failed to connect to all addresses
W1215 19:49:41.463658   610   627 metric_exporter.cc:211] Export metrics to agent failed: IOError: 14: failed to connect to all addresses
W1215 19:49:51.472220   610   627 metric_exporter.cc:211] Export metrics to agent failed: IOError: 14: failed to connect to all addresses
W1215 19:49:51.472220   610   627 metric_exporter.cc:211] Export metrics to agent failed: IOError: 14: failed to connect to all addresses

# cat gcs_server.out 
I1215 19:02:14.853004    34    34 io_service_pool.cc:36] IOServicePool is running with 1 io_service.
I1215 19:02:14.866667    34    34 gcs_redis_failure_detector.cc:30] Starting redis failure detector.
I1215 19:02:14.866946    34    34 gcs_object_manager.cc:271] Loading initial data.
I1215 19:02:14.867050    34    34 gcs_node_manager.cc:424] Loading initial data.
I1215 19:02:14.867504    34    34 gcs_object_manager.cc:286] Finished loading initial data.
I1215 19:02:14.867750    34    34 gcs_node_manager.cc:445] Finished loading initial data.
I1215 19:02:14.867771    34    34 gcs_actor_manager.cc:913] Loading initial data.
I1215 19:02:14.867983    34    34 gcs_actor_manager.cc:976] Finished loading initial data.
I1215 19:02:14.869662    34    34 grpc_server.cc:74] GcsServer server started, listening on port 10023.
I1215 19:02:14.977577    34    34 gcs_server.cc:260] Gcs server address = 192.168.250.10:10023
I1215 19:02:14.977643    34    34 gcs_server.cc:264] Finished setting gcs server address: 192.168.250.10:10023
I1215 19:02:15.133950    34    34 gcs_node_manager.cc:175] Registering node info, node id = 08941a1c2a9fb74ecd6213f6e3bbf6b62c4774f8, address = 192.168.250.10
I1215 19:02:15.134243    34    34 gcs_node_manager.cc:181] Finished registering node info, node id = 08941a1c2a9fb74ecd6213f6e3bbf6b62c4774f8, address = 192.168.250.10
I1215 19:02:15.135046    34    34 gcs_job_manager.cc:93] Getting all job info.
I1215 19:02:15.135159    34    34 gcs_job_manager.cc:99] Finished getting all job info.
I1215 19:02:22.879297    34    34 gcs_node_manager.cc:175] Registering node info, node id = 3dd2b650e990f305144e21c1266f5a9ac78ccd40, address = 192.168.250.17
I1215 19:02:22.879776    34    34 gcs_node_manager.cc:181] Finished registering node info, node id = 3dd2b650e990f305144e21c1266f5a9ac78ccd40, address = 192.168.250.17
I1215 19:02:22.883252    34    34 gcs_job_manager.cc:93] Getting all job info.
I1215 19:02:22.883749    34    34 gcs_job_manager.cc:99] Finished getting all job info.
I1215 19:48:41.397243    34    34 gcs_job_manager.cc:26] Adding job, job id = 02000000, driver pid = 610
I1215 19:48:41.397565    34    34 gcs_job_manager.cc:36] Finished adding job, job id = 02000000, driver pid = 610

No new outputs in worker node.

The only difference I know is the sockets which connected to head’s object_manager_port were closed.

>>> import ray
>>> ray.init(address="auto")
2020-12-15 21:44:42,867 INFO worker.py:634 -- Connecting to existing Ray cluster at address: 192.168.250.10:6379
F1215 21:48:12.604311   336   336 raylet_client.cc:108]  Check failed: _s.ok() [RayletClient] Unable to register worker with raylet.: IOError: No such file or directory
*** Check failure stack trace: ***
    @     0x7f2ea484a8ad  google::LogMessage::Fail()
    @     0x7f2ea484ba0c  google::LogMessage::SendToLog()
    @     0x7f2ea484a589  google::LogMessage::Flush()
    @     0x7f2ea484a7a1  google::LogMessage::~LogMessage()
    @     0x7f2ea4801949  ray::RayLog::~RayLog()
    @     0x7f2ea452a958  ray::raylet::RayletClient::RayletClient()
    @     0x7f2ea44c8c27  ray::CoreWorker::CoreWorker()
    @     0x7f2ea44ccdd4  ray::CoreWorkerProcess::CreateWorker()
    @     0x7f2ea44ce042  ray::CoreWorkerProcess::CoreWorkerProcess()
    @     0x7f2ea44cea0b  ray::CoreWorkerProcess::Initialize()
    @     0x7f2ea4408bfe  __pyx_pw_3ray_7_raylet_10CoreWorker_1__cinit__()
    @     0x7f2ea440a3e5  __pyx_tp_new_3ray_7_raylet_CoreWorker()
    @           0x551b15  (unknown)
    @           0x5aa6ec  _PyObject_FastCallKeywords
    @           0x50abb3  (unknown)
    @           0x50c5b9  _PyEval_EvalFrameDefault
    @           0x508245  (unknown)
    @           0x50a080  (unknown)
    @           0x50aa7d  (unknown)
    @           0x50d390  _PyEval_EvalFrameDefault
    @           0x50888b  (unknown)
    @           0x50a080  (unknown)
    @           0x50aa7d  (unknown)
    @           0x50d390  _PyEval_EvalFrameDefault
    @           0x508245  (unknown)
    @           0x50b403  PyEval_EvalCode
    @           0x635222  (unknown)
    @           0x4ad8e5  (unknown)
    @           0x4afd04  PyRun_InteractiveLoopFlags
    @           0x638c73  PyRun_AnyFileExFlags
    @           0x639631  Py_Main
    @           0x4b0f40  main