[issue] Abnormal memory increase in head node gcs

AndreKuu · May 17, 2023, 7:15am

How severe does this issue affect your experience of using Ray?

High: It blocks me to complete my task.

I am using ray job to do some batch data processing. I found abnormal memory increase in head node and the memory used by gcs server. Continuously growing by 8 GB per hour util OOM and the job actually hangs when the abnormal memory increasing began.

This situation didn’t happen from the beginning of job or in some small jobs. This only happened when job ran for more than 6 hours and more than 50000 tasks.

Other related information:

Using kubernetes manage head node and worker nodes;
The number of worker nodes is 400.
Only one job is running when this situation happened.
Both CPU and GPU tasks exist.
Worker nodes go online and offline a few times because of OOM and disk pressure.
log_to_driver == False
When the usage of gcs memory in head node began to increase abnormally, looks like no more task agent created any more. It looks like this job also hangs.

Anybody can give me some options about this issue? Or how can i debug ? Any available logs can get.

AndreKuu · May 17, 2023, 8:10am

cade · May 17, 2023, 5:56pm

Thanks for reporting this! looks bad… can you share the GCS logs? Also, what Ray version are you on?

GCS logs can be found in /tmp/ray/session_latest/logs/gcs_server.out

cc @yic

AndreKuu · May 31, 2023, 9:12am

Sure. thank you cade. The verison of ray im using is 2.0.1.
I report this issue on github. It looks like fixed right now.

AndreKuu · May 31, 2023, 9:12am

github.com/ray-project/ray

[Core] abnormal memory increase in head node gcs

opened 09:04AM - 17 May 23 UTC

closed 05:22PM - 26 May 23 UTC

AndreKuu

bug P1 @external-author-action-required core

I am using ray job to do some batch data processing. I found abnormal memory inc…rease in head node and the memory used by gcs server. Continuously growing by 8 GB per hour util OOM and the job actually hangs. This situation didn’t happen from the beginning of job or in some small jobs. This only happened when job ran for more than 6 hours and more than 50000 tasks. Other related information: Using kubernetes manage head node and worker nodes; 1. The number of worker nodes is 400. 2. Only one job is running when this situation happened. 3. Both CPU and GPU tasks exist. 4. Worker nodes go online and offline a few times because of OOM and disk pressure. 5. log_to_driver == False When the usage of gcs memory in head node began to increase abnormally, looks like no more task agent created any more. It looks like this job also hangs. Any ideas? ### Versions / Dependencies version: ray 2.0.1 ### Reproduction script - ### Issue Severity None

AndreKuu · May 31, 2023, 9:16am

I used ray 2.4.0 reproduced this problem.

I can offer all logs if you need. gcs_server.out is here.If there is attention to help and ensure the repair of the problem. ray_logs/gcs_server.out at master · AndreKuu/ray_logs · GitHub

sangcho · June 1, 2023, 2:23pm

There was a relevant fix included in Ray 2.5 (will be released in a few days). Let us know how this works. If this is not fixed, please follow up to the Github issue!

AndreKuu · June 4, 2023, 8:25am

Yep. I will follow this issue whether to be fixed in the new version. Thanks, sangcho.

Topic		Replies	Views
How to get gcs server momery distribution to debug memory continued increasement? Ray Core	5	467	April 25, 2023
Memory leak in ray head Ray Clusters	4	1076	December 16, 2021
Weird error logs when running Out Of Memory (OOM) Ray Core	6	2776	April 11, 2023
Gcs_server takes almost 100% cpu even though there's no running task Ray Core	14	1049	June 15, 2022
Gcs_server.out file filling up with Couldn't get resource request from raylet Kubernetes	4	572	November 7, 2021

[issue] Abnormal memory increase in head node gcs

Related topics