Cannot connect to Ray head after some workload

  • High: It blocks me to complete my task.

Greetings! I set up a Ray cluster on a few computers following Launching an On-Premise Cluster — Ray 2.0.0. Testing with light workload it works fine. However, if I test the cluster with heavy workloads (takes >30min to complete), then the Ray head goes into a broken state and it can no longer be connected to until restarted (ray.init() will hang forever). The dashboard still runs, though. How can this issue be debugged? Thanks!

It happened again today. GCS server is down but the dashboard is still up. The only error log i could find for the gcs server is (gcs_server.err):

E1024 13:06:16.825174807    1637 tcp_server_posix.cc:213]    Failed accept4: Too many open files

Reproduced it again. Head server is down seconds after submitting the job:

2022-10-25 10:37:55,912 INFO worker.py:1333 -- Connecting to existing Ray cluster at address: 192.168.1.9:6379...
2022-10-25 10:37:55,922 INFO worker.py:1509 -- Connected to Ray cluster. View the dashboard at 192.168.1.9:8265 
(pid=gcs_server) E1025 10:38:12.624350177    1647 tcp_server_posix.cc:213]    Failed accept4: Too many open files
(raylet, ip=192.168.1.14) [2022-10-25 10:38:17,636 E 2165 2165] gcs_rpc_client.h:202: Failed to connect to GCS at address 192.168.2.100:6379 within 5 seconds.
(raylet, ip=192.168.1.14) [2022-10-25 10:38:17,639 E 2266 2266] gcs_rpc_client.h:202: Failed to connect to GCS at address 192.168.2.100:6379 within 5 seconds.
(raylet, ip=192.168.1.14) [2022-10-25 10:38:17,643 E 2021 2021] gcs_rpc_client.h:202: Failed to connect to GCS at address 192.168.2.100:6379 within 5 seconds.
(raylet, ip=192.168.1.14) [2022-10-25 10:38:17,686 E 2178 2178] gcs_rpc_client.h:202: Failed to connect to GCS at address 192.168.2.100:6379 within 5 seconds.