[issue] Abnormal memory increase in head node gcs

How severe does this issue affect your experience of using Ray?

  • High: It blocks me to complete my task.

I am using ray job to do some batch data processing. I found abnormal memory increase in head node and the memory used by gcs server. Continuously growing by 8 GB per hour util OOM and the job actually hangs when the abnormal memory increasing began.

This situation didn’t happen from the beginning of job or in some small jobs. This only happened when job ran for more than 6 hours and more than 50000 tasks.

Other related information:

  1. Using kubernetes manage head node and worker nodes;
  2. The number of worker nodes is 400.
  3. Only one job is running when this situation happened.
  4. Both CPU and GPU tasks exist.
  5. Worker nodes go online and offline a few times because of OOM and disk pressure.
  6. log_to_driver == False
  7. When the usage of gcs memory in head node began to increase abnormally, looks like no more task agent created any more. It looks like this job also hangs.

Anybody can give me some options about this issue? Or how can i debug ? Any available logs can get.

Thanks for reporting this! looks bad… can you share the GCS logs? Also, what Ray version are you on?

GCS logs can be found in /tmp/ray/session_latest/logs/gcs_server.out

cc @yic

1 Like

Sure. thank you cade. The verison of ray im using is 2.0.1.
I report this issue on github. It looks like fixed right now.

I used ray 2.4.0 reproduced this problem.
I can offer all logs if you need. gcs_server.out is here.If there is attention to help and ensure the repair of the problem. ray_logs/gcs_server.out at master · AndreKuu/ray_logs · GitHub

There was a relevant fix included in Ray 2.5 (will be released in a few days). Let us know how this works. If this is not fixed, please follow up to the Github issue!

Yep. I will follow this issue whether to be fixed in the new version. Thanks, sangcho.