GCS too many open files

How severe does this issue affect your experience of using Ray?

  • High

Sometimes Ray GCS will complain that the TCP server has too many open files: (pid=gcs_server) E0927 23:54:58.340750467 1092 tcp_server_posix.cc:213] Failed accept4: Too many open files
similar to previous issues e.g. here: Ray Autoscaler Too Many Files Open - #4 by sangcho.

I tried the approach of Setting ulimits on EC2 instances on my local machine. My head node is local and cluster is a bunch of EC2 machines. I’m assuming GCS is hosted on local machine so this should suffice.

This didn’t work.

I am also wondering how GCS garbage collects connections. When you manually kill an actor with ray.kill() does that free up GCS resources? My workload launches a lot of actors, kills all of them, and does it all over again many times.

This problem is extremely annoying as the Ray cluster basically becomes unusable after this happens with new actors refusing to be registered. So I have to take down the entire cluster and relaunch.

Hey @marsupialtail - thanks for the question and the previous issue linked.

A couple of asks from me:

  1. Would you mind sharing more on the rate of actors launched/killed/relaunched and etc here?
  2. Do you also delete (or make them go out of scope) the actor handles of those killed actors?

I launch them at around 100 actors per minute and kill them at the same rate.

I don’t delete actor handles after I kill them. Should I?

I have had bad experiences with trying to use del to influence Python’s garbage collector, but I will try it.

deleting the handle didn’t help

As a stop gap solution Id greatly appreciate if somebody tell me how to set ulimit for the gcs_server process. ulimit in terminal right before launching ray app doesn’t work

Yeah setting the ulimit on the head node solves this problem. But I guess this is not a sustainable solution if actors are not GC’ed in the GCS.

Hey @marsupialtail sorry for the late reply here.

As for the ulimit setting
I might be wrong, I thought it would be a system level configs for all the processes? Does this post help? The GCS always lives on the head node, so if you could configure that on the head node, I believe it should be in effect.

I launch them at around 100 actors per minute and kill them at the same rate.

Ok, this doesn’t sound too much. I believe we do GCed actor resources once they go out of scope (which is why I ask if deleting the handles work). There might be some system configs on GCS for this. @yic

1 Like

Can you give us some repro script that we can try? Setting high ulimit in a head node is the best practice in Ray (as you will have lots of connections), and actor connection should be GC’ed IIUC. But we can try reproducing it.

Sorry it’s hard to give a reproduction, but I was able to solve this by increasing the ulimit.

I’m not sure if this is the same as my case, but I posted a simple script that reproduces the issue: