Memory leak in ray head

kubav · December 7, 2021, 7:57am

CPU and memory usage on ray-head pod is still increasing and has to be restarted every 3 days.

I have checked that it is not caused by storing objects in cluster but it is probably caused by redis database used by GCS. Records in database are being created but they are never deleted.

I have tried to clean database manually and some of the record can be safely deleted. i.e. “DASHBOARD*” keys are not needed and deleting them delays the time when head node needs to be restarted.

Do you know if this is ray bug or some configuration issue on our side?

zhz · December 10, 2021, 1:47pm

Thanks @kubav ! FYI @sangcho is investigating a memory issue that could be related

Chen_Shen · December 10, 2021, 7:29pm

@kubav were you able to identify which process(es) is the offender? also if possible let’s move the conversation [P0][Bug] Memory leak in ray head · Issue #21016 · ray-project/ray · GitHub

ericl · December 15, 2021, 11:20pm

To clarify the triggering condition of this leak, is this when running multiple jobs over time? If so, that’s likely Remote function and actor definitions are not garbage collected when drivers exit, so memory increases in cluster setting · Issue #8822 · ray-project/ray · GitHub

Or is it something else (e.g., memory increases without new jobs being run at all?)

kubav · December 16, 2021, 7:56am

Yes, it is the issue you linked.

Topic		Replies	Views
Ray head memory leak in 1.13? Ray Core	5	560	July 18, 2022
RayCluster does not limit the total job info stored in redis Ray Clusters	2	18	February 12, 2025
[issue] Abnormal memory increase in head node gcs Ray Core	7	674	June 4, 2023
ray::IDLE_SpillWorker memory consumption and OOM Ray Clusters	4	211	September 10, 2024
Ray head node stops responding Ray Clusters	4	124	October 23, 2024

Memory leak in ray head

Related topics