Memory leak in ray head

CPU and memory usage on ray-head pod is still increasing and has to be restarted every 3 days.

I have checked that it is not caused by storing objects in cluster but it is probably caused by redis database used by GCS. Records in database are being created but they are never deleted.

I have tried to clean database manually and some of the record can be safely deleted. i.e. “DASHBOARD*” keys are not needed and deleting them delays the time when head node needs to be restarted.

Do you know if this is ray bug or some configuration issue on our side?

Thanks @kubav ! FYI @sangcho is investigating a memory issue that could be related

@kubav were you able to identify which process(es) is the offender? also if possible let’s move the conversation [P0][Bug] Memory leak in ray head · Issue #21016 · ray-project/ray · GitHub

1 Like

To clarify the triggering condition of this leak, is this when running multiple jobs over time? If so, that’s likely Remote function and actor definitions are not garbage collected when drivers exit, so memory increases in cluster setting · Issue #8822 · ray-project/ray · GitHub

Or is it something else (e.g., memory increases without new jobs being run at all?)

Yes, it is the issue you linked.