I am having an issue with using the Ray autoscaler on AWS. The head node seems to work fine, but when the autoscaler tries to spin up new nodes I get the following error:
(pid=gcs_server) E0412 21:52:02.397838248 168 tcp_server_posix.cc:213] Failed accept4: Too many open files
2021-04-12 21:52:08,435 WARNING worker.py:1086 – A worker died or was killed while executing task ffffffffffffffff41431ab4001c6d2e49c9459101000000.
Followed shortly by
service_based_gcs_client.cc:229: Couldn’t reconnect to GCS server. The last attempted GCS server address was 172.31.17.191:40789
I’m assuming the GCS server is crashing somehow, maybe related to the number of files that the system allows to be open at any given time. I don’t think it’s a resource constraint issue because I’ve used a variety of node types. I’m really not sure what to make of this. Thanks for the help!
How can I reset the head node without shutting down the cluster? If I reboot it through the AWS console I lose access to the container. If I use ray down/ray up like this answer recommends it will shut down the node and spin up a new one.
You’ll most likely have to restart the head node at least once to change your configuration. If you’re using docker, you will want to set start your docker container with the correct ulimit (by adding the docker run option --ulimit nofile=65535:65535.