Root disk usage keeps increasing

vgill · March 31, 2023, 10:43pm

Hi Folks! I have a long running task across 20+ nodes. I find the disk root (ebs mounted) usage keeps creeping up across all nodes. In 4 hours it hits 60 % of 1 TB. I have the object spilling configured to a different partition (instance store) so the disk is not being used by the object store. df reports the increased disk usage but du fails to find anything significant. I notice lot of deleted open files and this may be the reasons for the disk usage: lsof | grep -i deleted
raylet 19250 ec2-user 11u REG 0,21 77116735496 3 /dev/shm/plasmayLGEPa (deleted)
raylet 19250 ec2-user 440u REG 202,1 10182762504 939524237 /tmp/ray/plasmatpsSwl (deleted)
raylet 19250 ec2-user 446u REG 202,1 10182762504 939524234 /tmp/ray/plasmawwKo0i (deleted)
raylet 19250 ec2-user 447u REG 202,1 10182758408 939524235 /tmp/ray/plasmaPD8qWO (deleted)
raylet 19250 ec2-user 448u REG 202,1 10182766600 939524236 /tmp/ray/plasma5x0BZk (deleted)
raylet 19250 ec2-user 462u REG 202,1 10181189640 939524238 /tmp/ray/plasmaoooTq6 (deleted)
If I run the task for 3 more hours it will cause the machines to run out of disk space. This pretty much becomes a blocker for executing long running tasks. Is this a bug in ray ? Is there any workaround ?

matthewdeng · April 2, 2023, 9:50pm

Hey @vgill, moving this to the Ray Core category as it seems related to object spilling.

How much object spilling is actually occurring? Is the number you are seeing proportional to the amount of data actually being spilled to disk?

Chen_Shen · April 4, 2023, 5:48pm

added to what @matthewdeng s question, Object Spilling — Ray 2.3.1 has more context on querying how much object has been spilled, and also the ways to configure the object store.

vgill · April 4, 2023, 8:39pm

Hello Matthew! The disk usage seems to be related to plasma object store fallback at /tmp/ray . https://github.com/ray-project/ray/pull/16097. I have the object spilling configured to a different partition and all seems to be good there. Let me try to get more stats around the spilled objects by using the link mentioned by Chen. I have a basic question, if the job has succeeded the cluster is sitting idle, would all of the disk occupied by plasma fallback, one way or the other released back or it won’t be. Is this disk usage around plasma fallback recycled among different objects when the jobs are running ?

Topic		Replies	Views
Why is Ray spilling objects to disk even though there is enough memory Ray Core	6	906	January 19, 2021
Ray head node regularly using up all host disk space Ray Clusters	0	473	June 22, 2021
Usage of Disk Space grows due to object spilling Ray Core	4	786	October 12, 2022
Raylet space running out, despite having plenty of RAM Ray Core	8	3222	March 27, 2023
Ray object store is not full and spillto disk Ray Core	1	299	September 6, 2022

Root disk usage keeps increasing

Related topics