Ray Autoscaler Too Many Files Open

I am having an issue with using the Ray autoscaler on AWS. The head node seems to work fine, but when the autoscaler tries to spin up new nodes I get the following error:

(pid=gcs_server) E0412 21:52:02.397838248 168 tcp_server_posix.cc:213] Failed accept4: Too many open files
2021-04-12 21:52:08,435 WARNING worker.py:1086 – A worker died or was killed while executing task ffffffffffffffff41431ab4001c6d2e49c9459101000000.

Followed shortly by

service_based_gcs_client.cc:229: Couldn’t reconnect to GCS server. The last attempted GCS server address was 172.31.17.191:40789

I’m assuming the GCS server is crashing somehow, maybe related to the number of files that the system allows to be open at any given time. I don’t think it’s a resource constraint issue because I’ve used a variety of node types. I’m really not sure what to make of this. Thanks for the help!

A suggestion that may help, please check ulimit on the nodes, I had a similar kind of issue in my local setup and if possible increase ulimit.

I did try this. The highest I can set ulimit to is 4096, and I’m not sure if the issue is the ulimit of my head node or of the worker nodes.

gcs_server is only in a head node, so you need to increase your ulimit. If you have the ulimit permission issue check Setting ulimits on EC2 instances

1 Like

How can I reset the head node without shutting down the cluster? If I reboot it through the AWS console I lose access to the container. If I use ray down/ray up like this answer recommends it will shut down the node and spin up a new one.

You’ll most likely have to restart the head node at least once to change your configuration. If you’re using docker, you will want to set start your docker container with the correct ulimit (by adding the docker run option --ulimit nofile=65535:65535.

1 Like