Cannot connect to Ray head after some workload

zzb3886 · October 18, 2022, 4:39am

High: It blocks me to complete my task.

Greetings! I set up a Ray cluster on a few computers following Launching an On-Premise Cluster — Ray 2.0.0. Testing with light workload it works fine. However, if I test the cluster with heavy workloads (takes >30min to complete), then the Ray head goes into a broken state and it can no longer be connected to until restarted (ray.init() will hang forever). The dashboard still runs, though. How can this issue be debugged? Thanks!

zzb3886 · October 24, 2022, 8:25pm

It happened again today. GCS server is down but the dashboard is still up. The only error log i could find for the gcs server is (gcs_server.err):

E1024 13:06:16.825174807    1637 tcp_server_posix.cc:213]    Failed accept4: Too many open files

zzb3886 · October 25, 2022, 5:40pm

Reproduced it again. Head server is down seconds after submitting the job:

2022-10-25 10:37:55,912 INFO worker.py:1333 -- Connecting to existing Ray cluster at address: 192.168.1.9:6379...
2022-10-25 10:37:55,922 INFO worker.py:1509 -- Connected to Ray cluster. View the dashboard at 192.168.1.9:8265 
(pid=gcs_server) E1025 10:38:12.624350177    1647 tcp_server_posix.cc:213]    Failed accept4: Too many open files
(raylet, ip=192.168.1.14) [2022-10-25 10:38:17,636 E 2165 2165] gcs_rpc_client.h:202: Failed to connect to GCS at address 192.168.2.100:6379 within 5 seconds.
(raylet, ip=192.168.1.14) [2022-10-25 10:38:17,639 E 2266 2266] gcs_rpc_client.h:202: Failed to connect to GCS at address 192.168.2.100:6379 within 5 seconds.
(raylet, ip=192.168.1.14) [2022-10-25 10:38:17,643 E 2021 2021] gcs_rpc_client.h:202: Failed to connect to GCS at address 192.168.2.100:6379 within 5 seconds.
(raylet, ip=192.168.1.14) [2022-10-25 10:38:17,686 E 2178 2178] gcs_rpc_client.h:202: Failed to connect to GCS at address 192.168.2.100:6379 within 5 seconds.

Topic		Replies	Views
Gcs_server: Too many open files Ray Core	2	1004	February 12, 2023
[ray1.0.0] stuck when connecting to existing ray cluster Ray Core	6	1700	December 15, 2020
Ray start --head Unable to connect to GCS Ray Core	13	8897	June 21, 2022
Local Cluster - Failed to connect to GCS Ray Core	3	1683	August 21, 2023
Ray head isn't starting properly sometimes Ray Core	7	525	April 28, 2023

Cannot connect to Ray head after some workload

Related topics