How severe does this issue affect your experience of using Ray?
- Low: It annoys or frustrates me for a moment.
My Ray cluster (version 2.2.0) often stops working with an error like this:
(pid=gcs_server, ip=127.0.0.1) E0206 03:31:52.119873000 123145335304192 tcp_server_posix.cc:213] Failed accept4: Too many open files
I found a similar topic, and I know there is a workaround of using ulimit
.
I wrote a script that reproduces the issue. Is this a bug, or am I doing a wrong thing?
My Environment
Head node: macOS 13.2 (4 cores)
Worker node: macOS 13.2 (8 cores)
Python: 3.10
Ray: 2.2.0
Step 1: Start a cluster
This issue doesn’t happen when I start Ray on a single machine, so I created a two-node cluster.
Start a head node:
# Set the number of file descriptors
node1 % ulimit -n 128
node1 % ray start --head --node-ip-address=192.168.0.2
Start a worker node:
node2 % ray start --address=192.168.0.2:6379 --node-ip-address=192.168.0.3
Step 2: Run a script
I run the following script on the head node:
import time
import ray
@ray.remote
def task(i):
time.sleep(0.1)
return i
with ray.init(_node_ip_address="192.168.0.2"):
print(ray.get([task.remote(i) for i in range(100)]))
The script runs successfully for the first few runs:
node1 % python sample-script.py
2023-02-06 03:31:39,516 INFO worker.py:1352 -- Connecting to existing Ray cluster at address: 192.168.0.2:6379...
2023-02-06 03:31:39,525 INFO worker.py:1529 -- Connected to Ray cluster. View the dashboard at 127.0.0.1:8265
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99]
However, the script suddenly shows the following error after a few runs:
node1 % python sample-script.py
2023-02-06 03:31:50,959 INFO worker.py:1352 -- Connecting to existing Ray cluster at address: 192.168.0.2:6379...
2023-02-06 03:31:50,969 INFO worker.py:1529 -- Connected to Ray cluster. View the dashboard at 127.0.0.1:8265
(pid=gcs_server, ip=127.0.0.1) E0206 03:31:52.119873000 123145335304192 tcp_server_posix.cc:213] Failed accept4: Too many open files
(raylet, ip=127.0.0.1) [2023-02-06 03:31:57,132 E 28338 1041974] gcs_rpc_client.h:202: Failed to connect to GCS at address 192.168.0.2:6379 within 5 seconds.
(raylet, ip=127.0.0.1) [2023-02-06 03:31:57,255 E 28343 1041984] gcs_rpc_client.h:202: Failed to connect to GCS at address 192.168.0.2:6379 within 5 seconds.
(scheduler +7s, ip=127.0.0.1) Tip: use `ray status` to view detailed cluster status. To disable these messages, set RAY_SCHEDULER_EVENTS=0.
(scheduler +7s, ip=127.0.0.1) Warning: The following resource request cannot be scheduled right now: {'CPU': 1.0}. This is likely due to all cluster resources being claimed by actors. Consider creating fewer actors or adding more nodes to this Ray cluster.
After this, I have to restart the cluster, which is really annoying.