Gcs_server: Too many open files

How severe does this issue affect your experience of using Ray?

  • Low: It annoys or frustrates me for a moment.

My Ray cluster (version 2.2.0) often stops working with an error like this:

(pid=gcs_server, ip=127.0.0.1) E0206 03:31:52.119873000 123145335304192 tcp_server_posix.cc:213]      Failed accept4: Too many open files

I found a similar topic, and I know there is a workaround of using ulimit.

I wrote a script that reproduces the issue. Is this a bug, or am I doing a wrong thing?


My Environment

Head node: macOS 13.2 (4 cores)
Worker node: macOS 13.2 (8 cores)
Python: 3.10
Ray: 2.2.0

Step 1: Start a cluster

This issue doesn’t happen when I start Ray on a single machine, so I created a two-node cluster.

Start a head node:

# Set the number of file descriptors
node1 % ulimit -n 128

node1 % ray start --head --node-ip-address=192.168.0.2

Start a worker node:

node2 % ray start --address=192.168.0.2:6379 --node-ip-address=192.168.0.3

Step 2: Run a script

I run the following script on the head node:

import time
import ray

@ray.remote
def task(i):
    time.sleep(0.1)
    return i

with ray.init(_node_ip_address="192.168.0.2"):
    print(ray.get([task.remote(i) for i in range(100)]))

The script runs successfully for the first few runs:

node1 % python sample-script.py
2023-02-06 03:31:39,516	INFO worker.py:1352 -- Connecting to existing Ray cluster at address: 192.168.0.2:6379...
2023-02-06 03:31:39,525	INFO worker.py:1529 -- Connected to Ray cluster. View the dashboard at 127.0.0.1:8265
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99]

However, the script suddenly shows the following error after a few runs:

node1 % python sample-script.py
2023-02-06 03:31:50,959	INFO worker.py:1352 -- Connecting to existing Ray cluster at address: 192.168.0.2:6379...
2023-02-06 03:31:50,969	INFO worker.py:1529 -- Connected to Ray cluster. View the dashboard at 127.0.0.1:8265
(pid=gcs_server, ip=127.0.0.1) E0206 03:31:52.119873000 123145335304192 tcp_server_posix.cc:213]      Failed accept4: Too many open files
(raylet, ip=127.0.0.1) [2023-02-06 03:31:57,132 E 28338 1041974] gcs_rpc_client.h:202: Failed to connect to GCS at address 192.168.0.2:6379 within 5 seconds.
(raylet, ip=127.0.0.1) [2023-02-06 03:31:57,255 E 28343 1041984] gcs_rpc_client.h:202: Failed to connect to GCS at address 192.168.0.2:6379 within 5 seconds.
(scheduler +7s, ip=127.0.0.1) Tip: use `ray status` to view detailed cluster status. To disable these messages, set RAY_SCHEDULER_EVENTS=0.
(scheduler +7s, ip=127.0.0.1) Warning: The following resource request cannot be scheduled right now: {'CPU': 1.0}. This is likely due to all cluster resources being claimed by actors. Consider creating fewer actors or adding more nodes to this Ray cluster.

After this, I have to restart the cluster, which is really annoying.

hi @k24d the ulimit you set is probably too small. It’s recommended to set to a much larger number.

Hi @Chen_Shen. I set a small value to reproduce the issue quickly. I tried larger values and realized this issue doesn’t happen when I set ulimit -n 8192.

In order to see what’s going on, I monitored gcs_server’s TCP states (using lsof). I called ray.init() and ran 100 tasks repeatedly (200 iterations).

As a result, the number of TCP connections changed as follows:

download-1-2

It seems CLOSED sockets are removed periodically, but it doesn’t happen for the first few thousand connections.

I don’t know why this happens, but I’ll use ulimit -n 8192 or larger. Thanks.