How to get gcs server momery distribution to debug memory continued increasement?

Basasuya · April 3, 2023, 12:37pm

Medium: It contributes to significant difficulty to complete my task, but I can work around it.

I launched a long running ray cluster. Our teammates use cluster by ray client and ray job.

I know the gcs server memory stores more and more metadata after lots of job is submitted. But I need to know why the memory is increased significantly.

My situation: 100 worker cluster, 3 days，60+job is submitted, 1000+actors (most are finished), but the gcs memory has increased to nearly 15GB now

yic · April 4, 2023, 10:11pm

Which version is you using? Before 2.3, some observability data is stored inside GCS which lead to memory footprint bad.

If you are using 2.3, do you have some simple way to reproduce this? We’ll take a look at that.

Basasuya · April 15, 2023, 3:36pm

@yic sorry for the delay replay. I am using 2.3.0 , sorry maybe I have no simple way to reproduce on our large cluster.

Is there any simple way to know how many actor meta, runtime resource meta stored in gcs memory(is just my inference) ？

Wanxing_Wang · April 20, 2023, 8:27am

Launch a pod in my k8s, with 8c32g
pip install “ray[default]”==2.3.1
exec into my pod, and launch ray with:

ray start --head --block --port=6380 --dashboard-host="0.0.0.0"

Repeat submitting the job 100 times from local laptop with:

seq 100 | xargs -Iz ray job submit --runtime-env-json='{"working_dir": "./"}' -- python3 test.py

my job code:

import ray

ray.init(address='auto')

@ray.remote(num_cpus=0.01)
class MyActor:
    def ping(self):
        return 100

ACTOR_NUM = 100
l = []
for i in range(ACTOR_NUM):
    l.append(MyActor.remote())

for actor in l:
    ray.get(actor.ping.remote())

print("Job Done")

observe the memory(RES) growth of gcs_server, each job grows by 15M, and it never goes down until OOM.

rickyyx · April 24, 2023, 7:30pm

@Wanxing_Wang I beleive GCS cached metadata from jobs and might not be cleaning up promptly in this case, for example, dead actors.

What if you set RAY_maximum_gcs_destroyed_actor_cached_count to a smaller value (it’s default 100k)

Wanxing_Wang · April 25, 2023, 9:31am

Thank you very much! We will try to adjust this environment variable in the production environment.

Topic		Replies	Views
[issue] Abnormal memory increase in head node gcs Ray Core	7	700	June 4, 2023
GCS too many open files Ray Core	9	1337	February 5, 2023
Gcs_server takes almost 100% cpu even though there's no running task Ray Core	14	1005	June 15, 2022
Ray Actor RAM usage keep growing Ray Core	7	1086	June 9, 2021
Memory leak in ray head Ray Clusters	4	1055	December 16, 2021

How to get gcs server momery distribution to debug memory continued increasement?

Related topics