Hi there! I am using BOHB from tune and autoscaling to AWS to launch experiments with a few thousand trials. My last experiment lasted about 4 days and seemed to be doing fine when it suddenly ran into a memory error on the head node. I’ve plotted the memory use as I suspected a slow leak, but memory usage was stable. The tracebacks related to the memory error mention a series of idle ray processes which are not releasing resources. Any suggestions on how I could go about identifying and fixing the cause?
File “python/ray/_raylet.pyx”, line 440, in ray._raylet.execute_task
File “/home/ray/anaconda3/lib/python3.7/site-packages/ray/memory_monitor.py”, line 132, in raise_if_low_memory self.error_threshold))
ray.memory_monitor.RayOutOfMemoryError: More than 95% of the memory on node ip-172-31-95-132 is used (119.84 / 124.38 GB). The top 10 memory consumers are:
PID MEM COMMAND 856 2.46GiB python driver.py 706 1.27GiB /home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/thirdparty/redis/src/redis-server * 22700 0.42GiB ray::IDLE 27300 0.42GiB ray::IDLE 27301 0.42GiB ray::IDLE 24284 0.42GiB ray::IDLE 24283 0.42GiB ray::IDLE 24294 0.42GiB ray::IDLE