How severe does this issue affect your experience of using Ray?
High
Sometimes Ray GCS will complain that the TCP server has too many open files: (pid=gcs_server) E0927 23:54:58.340750467 1092 tcp_server_posix.cc:213] Failed accept4: Too many open files
similar to previous issues e.g. here: Ray Autoscaler Too Many Files Open - #4 by sangcho.
I tried the approach of Setting ulimits on EC2 instances on my local machine. My head node is local and cluster is a bunch of EC2 machines. I’m assuming GCS is hosted on local machine so this should suffice.
This didn’t work.
I am also wondering how GCS garbage collects connections. When you manually kill an actor with ray.kill() does that free up GCS resources? My workload launches a lot of actors, kills all of them, and does it all over again many times.
This problem is extremely annoying as the Ray cluster basically becomes unusable after this happens with new actors refusing to be registered. So I have to take down the entire cluster and relaunch.
As a stop gap solution Id greatly appreciate if somebody tell me how to set ulimit for the gcs_server process. ulimit in terminal right before launching ray app doesn’t work
As for the ulimit setting
I might be wrong, I thought it would be a system level configs for all the processes? Does this post help? The GCS always lives on the head node, so if you could configure that on the head node, I believe it should be in effect.
I launch them at around 100 actors per minute and kill them at the same rate.
Ok, this doesn’t sound too much. I believe we do GCed actor resources once they go out of scope (which is why I ask if deleting the handles work). There might be some system configs on GCS for this. @yic
Can you give us some repro script that we can try? Setting high ulimit in a head node is the best practice in Ray (as you will have lots of connections), and actor connection should be GC’ed IIUC. But we can try reproducing it.