Ray Autoscaler Too Many Files Open

jonahrosenblum · April 13, 2021, 5:29am

I am having an issue with using the Ray autoscaler on AWS. The head node seems to work fine, but when the autoscaler tries to spin up new nodes I get the following error:

(pid=gcs_server) E0412 21:52:02.397838248 168 tcp_server_posix.cc:213] Failed accept4: Too many open files
2021-04-12 21:52:08,435 WARNING worker.py:1086 – A worker died or was killed while executing task ffffffffffffffff41431ab4001c6d2e49c9459101000000.

Followed shortly by

service_based_gcs_client.cc:229: Couldn’t reconnect to GCS server. The last attempted GCS server address was 172.31.17.191:40789

I’m assuming the GCS server is crashing somehow, maybe related to the number of files that the system allows to be open at any given time. I don’t think it’s a resource constraint issue because I’ve used a variety of node types. I’m really not sure what to make of this. Thanks for the help!

asm582 · April 13, 2021, 3:12pm

A suggestion that may help, please check ulimit on the nodes, I had a similar kind of issue in my local setup and if possible increase ulimit.

jonahrosenblum · April 13, 2021, 4:10pm

I did try this. The highest I can set ulimit to is 4096, and I’m not sure if the issue is the ulimit of my head node or of the worker nodes.

sangcho · April 13, 2021, 6:03pm

gcs_server is only in a head node, so you need to increase your ulimit. If you have the ulimit permission issue check Setting ulimits on EC2 instances

jonahrosenblum · April 14, 2021, 1:40am

How can I reset the head node without shutting down the cluster? If I reboot it through the AWS console I lose access to the container. If I use ray down/ray up like this answer recommends it will shut down the node and spin up a new one.

Alex · April 14, 2021, 3:02am

You’ll most likely have to restart the head node at least once to change your configuration. If you’re using docker, you will want to set start your docker container with the correct ulimit (by adding the docker run option --ulimit nofile=65535:65535.

Topic		Replies	Views
Gcs_server: Too many open files Ray Core	2	980	February 12, 2023
GCS too many open files Ray Core	9	1320	February 5, 2023
Cannot connect to Ray head after some workload Ray Clusters	2	800	October 25, 2022
Setting ulimits on EC2 instances Ray Core	1	3919	January 26, 2021
Autoscaler spawns workers, but they aren't set up correctly and/or head cannot connect to them Ray Clusters	0	338	May 28, 2021

Ray Autoscaler Too Many Files Open

Related topics