Idle workers not releasing resources causing memory error

Hi there! I am using BOHB from tune and autoscaling to AWS to launch experiments with a few thousand trials. My last experiment lasted about 4 days and seemed to be doing fine when it suddenly ran into a memory error on the head node. I’ve plotted the memory use as I suspected a slow leak, but memory usage was stable. The tracebacks related to the memory error mention a series of idle ray processes which are not releasing resources. Any suggestions on how I could go about identifying and fixing the cause?head_memory

File “python/ray/_raylet.pyx”, line 440, in ray._raylet.execute_task
File “/home/ray/anaconda3/lib/python3.7/site-packages/ray/memory_monitor.py”, line 132, in raise_if_low_memory self.error_threshold))
ray.memory_monitor.RayOutOfMemoryError: More than 95% of the memory on node ip-172-31-95-132 is used (119.84 / 124.38 GB). The top 10 memory consumers are:

    PID	MEM	COMMAND
    856	2.46GiB	python driver.py
    706	1.27GiB	/home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/thirdparty/redis/src/redis-server *
    22700	0.42GiB	ray::IDLE
    27300	0.42GiB	ray::IDLE
    27301	0.42GiB	ray::IDLE
    24284	0.42GiB	ray::IDLE
    24283	0.42GiB	ray::IDLE
    24294	0.42GiB	ray::IDLE

Hmm it’s unlikely that those idle processes are the problem here – they only add up to a few GiB at most. Is it possible that one of your trials was trying to load in a huge amount of data to memory or something like that?

These are only the top consumers. I don’t have the full list of processes but I suspect that there are many more unlisted ray::IDLE that somehow accumulated and cause the crash.

@eoakes I am also facing similar issue where ray::ImplicitFunc.train_buffered() and ray::IDLE this two process taking more then 80% of memory due to which memory is almost full causing DQN training to crash.

Hi @Eric_Pfleiderer,
Can you open a github issue and describe your set up with a repro script? We can follow up there. Thanks!

Is there any new progress on the issue here