How to get gcs server momery distribution to debug memory continued increasement?

  • Medium: It contributes to significant difficulty to complete my task, but I can work around it.

I launched a long running ray cluster. Our teammates use cluster by ray client and ray job.

I know the gcs server memory stores more and more metadata after lots of job is submitted. But I need to know why the memory is increased significantly.

My situation: 100 worker cluster, 3 days,60+job is submitted, 1000+actors (most are finished), but the gcs memory has increased to nearly 15GB now

Which version is you using? Before 2.3, some observability data is stored inside GCS which lead to memory footprint bad.

If you are using 2.3, do you have some simple way to reproduce this? We’ll take a look at that.

@yic sorry for the delay replay. I am using 2.3.0 , sorry maybe I have no simple way to reproduce on our large cluster.

Is there any simple way to know how many actor meta, runtime resource meta stored in gcs memory(is just my inference) ?

  1. Launch a pod in my k8s, with 8c32g
  2. pip install “ray[default]”==2.3.1
  3. exec into my pod, and launch ray with:
ray start --head --block --port=6380 --dashboard-host="0.0.0.0"
  1. Repeat submitting the job 100 times from local laptop with:
seq 100 | xargs -Iz ray job submit --runtime-env-json='{"working_dir": "./"}' -- python3 test.py

my job code:

import ray

ray.init(address='auto')

@ray.remote(num_cpus=0.01)
class MyActor:
    def ping(self):
        return 100

ACTOR_NUM = 100
l = []
for i in range(ACTOR_NUM):
    l.append(MyActor.remote())

for actor in l:
    ray.get(actor.ping.remote())

print("Job Done")
  1. observe the memory(RES) growth of gcs_server, each job grows by 15M, and it never goes down until OOM.

@Wanxing_Wang I beleive GCS cached metadata from jobs and might not be cleaning up promptly in this case, for example, dead actors.

What if you set RAY_maximum_gcs_destroyed_actor_cached_count to a smaller value (it’s default 100k)

1 Like

Thank you very much! We will try to adjust this environment variable in the production environment.