GCS too many open files

marsupialtail · September 27, 2022, 11:57pm

How severe does this issue affect your experience of using Ray?

High

Sometimes Ray GCS will complain that the TCP server has too many open files: (pid=gcs_server) E0927 23:54:58.340750467 1092 tcp_server_posix.cc:213] Failed accept4: Too many open files
similar to previous issues e.g. here: Ray Autoscaler Too Many Files Open - #4 by sangcho.

I tried the approach of Setting ulimits on EC2 instances on my local machine. My head node is local and cluster is a bunch of EC2 machines. I’m assuming GCS is hosted on local machine so this should suffice.

This didn’t work.

I am also wondering how GCS garbage collects connections. When you manually kill an actor with ray.kill() does that free up GCS resources? My workload launches a lot of actors, kills all of them, and does it all over again many times.

This problem is extremely annoying as the Ray cluster basically becomes unusable after this happens with new actors refusing to be registered. So I have to take down the entire cluster and relaunch.

rickyyx · September 29, 2022, 4:21pm

Hey @marsupialtail - thanks for the question and the previous issue linked.

A couple of asks from me:

Would you mind sharing more on the rate of actors launched/killed/relaunched and etc here?
Do you also delete (or make them go out of scope) the actor handles of those killed actors?

marsupialtail · September 30, 2022, 4:33am

I launch them at around 100 actors per minute and kill them at the same rate.

I don’t delete actor handles after I kill them. Should I?

I have had bad experiences with trying to use del to influence Python’s garbage collector, but I will try it.

marsupialtail · September 30, 2022, 6:41pm

deleting the handle didn’t help

marsupialtail · September 30, 2022, 7:10pm

As a stop gap solution Id greatly appreciate if somebody tell me how to set ulimit for the gcs_server process. ulimit in terminal right before launching ray app doesn’t work

marsupialtail · October 2, 2022, 9:10pm

Yeah setting the ulimit on the head node solves this problem. But I guess this is not a sustainable solution if actors are not GC’ed in the GCS.

rickyyx · October 3, 2022, 6:06pm

Hey @marsupialtail sorry for the late reply here.

As for the ulimit setting
I might be wrong, I thought it would be a system level configs for all the processes? Does this post help? The GCS always lives on the head node, so if you could configure that on the head node, I believe it should be in effect.

I launch them at around 100 actors per minute and kill them at the same rate.

Ok, this doesn’t sound too much. I believe we do GCed actor resources once they go out of scope (which is why I ask if deleting the handles work). There might be some system configs on GCS for this. @yic

sangcho · October 4, 2022, 6:51am

Can you give us some repro script that we can try? Setting high ulimit in a head node is the best practice in Ray (as you will have lots of connections), and actor connection should be GC’ed IIUC. But we can try reproducing it.

marsupialtail · October 4, 2022, 4:24pm

Sorry it’s hard to give a reproduction, but I was able to solve this by increasing the ulimit.

k24d · February 5, 2023, 7:29pm

I’m not sure if this is the same as my case, but I posted a simple script that reproduces the issue:

Topic		Replies	Views
Gcs_server: Too many open files Ray Core	2	991	February 12, 2023
Ray Autoscaler Too Many Files Open Ray Core	5	1669	April 14, 2021
How to get gcs server momery distribution to debug memory continued increasement? Ray Core	5	456	April 25, 2023
Cannot connect to Ray head after some workload Ray Clusters	2	806	October 25, 2022
GCS process ID keeps changing Ray Core	1	344	October 3, 2022

GCS too many open files

Related topics