Ray Writes to Disk Eventually Running Out of Space and Causing Eviction of the Node

Hello Everybody!

I have a problem with writing to disk. Namely, I have a task that does the following, ray.put on a pandas data frame and then pops up 300 to 600 tasks which perform different calculations on the data frame and return results which are one row per task, which is like 300 to 600 rows of return matrix. What happens is that while the task is running, the more tasks pop up, the more disk space it takes, and this causes an eviction of the cluster as it eventually runs out of memory. Problem is that there is already 100 gb of disk space available, so it should be plenty for any operations, but there must be something that is causing a problem, by not deleting unnecessary files. I’ve read about this problem of multiple calling of ray.put, but I only call the function once per run.

Has anybody had anything similar?

It seems its spilling too many objects as this is the folder that keeps piling up.

You can change the root temporary directory by passing --temp-dir={your temp path} to ray start or _temp_dir in ray.init()

Maybe try using “ray memory” to see where the memory is going. It sounds like maybe objects are referenced for longer than expected, causing excessive memory usage, or perhaps some return objects are larger than expected.

1 Like