Idle workers not releasing resources causing memory error

Eric_Pfleiderer · March 24, 2021, 3:12pm

Hi there! I am using BOHB from tune and autoscaling to AWS to launch experiments with a few thousand trials. My last experiment lasted about 4 days and seemed to be doing fine when it suddenly ran into a memory error on the head node. I’ve plotted the memory use as I suspected a slow leak, but memory usage was stable. The tracebacks related to the memory error mention a series of idle ray processes which are not releasing resources. Any suggestions on how I could go about identifying and fixing the cause? head_memory

File “python/ray/_raylet.pyx”, line 440, in ray._raylet.execute_task
File “/home/ray/anaconda3/lib/python3.7/site-packages/ray/memory_monitor.py”, line 132, in raise_if_low_memory self.error_threshold))
ray.memory_monitor.RayOutOfMemoryError: More than 95% of the memory on node ip-172-31-95-132 is used (119.84 / 124.38 GB). The top 10 memory consumers are:
    PID	MEM	COMMAND
    856	2.46GiB	python driver.py
    706	1.27GiB	/home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/thirdparty/redis/src/redis-server *
    22700	0.42GiB	ray::IDLE
    27300	0.42GiB	ray::IDLE
    27301	0.42GiB	ray::IDLE
    24284	0.42GiB	ray::IDLE
    24283	0.42GiB	ray::IDLE
    24294	0.42GiB	ray::IDLE

eoakes · March 24, 2021, 10:25pm

Hmm it’s unlikely that those idle processes are the problem here – they only add up to a few GiB at most. Is it possible that one of your trials was trying to load in a huge amount of data to memory or something like that?

Eric_Pfleiderer · March 27, 2021, 8:56pm

These are only the top consumers. I don’t have the full list of processes but I suspect that there are many more unlisted ray::IDLE that somehow accumulated and cause the crash.

Siddharth_Jain · September 5, 2021, 6:02pm

@eoakes I am also facing similar issue where ray::ImplicitFunc.train_buffered() and ray::IDLE this two process taking more then 80% of memory due to which memory is almost full causing DQN training to crash.

xwjiang2010 · September 7, 2021, 7:18pm

Hi @Eric_Pfleiderer,
Can you open a github issue and describe your set up with a repro script? We can follow up there. Thanks!

brosoul · April 27, 2023, 7:50am

Is there any new progress on the issue here

Topic		Replies	Views
ray::IDLE still takes a lot of memory Ray Core	3	1037	February 11, 2025
Lots of ray::IDLE proccess occupy GPU memory Ray Core	3	110	July 31, 2024
ray::IDLE take lot of memory and cluster util down to half. Ray Data	0	38	July 28, 2024
Weird error logs when running Out Of Memory (OOM) Ray Core	6	2688	April 11, 2023
False positive assertion of OOM results in OOM-killer terminating Ray Tune trials Ray Core	3	229	January 24, 2024

Idle workers not releasing resources causing memory error

Related topics