Hello, I’m hoping someone can help me figure out a Ray/docker disk space usage mystery.
I am running big jobs on a cluster, and the head node regularly consumes all available disk space on its host causing the cluster to die.
I am trying to figure out what is consuming the space, but the numbers I am seeing don’t add up. When I first start the cluster, the disk utilization looks like this (see the last two lines with 28G used):
# df -h
Filesystem Size Used Avail Use% Mounted on
udev 252G 0 252G 0% /dev
tmpfs 51G 3.7M 51G 1% /run
tmpfs 252G 0 252G 0% /dev/shm
tmpfs 5.0M 0 5.0M 0% /run/lock
tmpfs 252G 0 252G 0% /sys/fs/cgroup
/dev/sda2 15G 234M 14G 2% /boot
/dev/sda1 188M 7.8M 180M 5% /boot/efi
tmpfs 51G 0 51G 0% /run/user/1000
/dev/sda4 147G 28G 114G 19% /
overlay 147G 28G 114G 19% /var/lib/docker/overlay2/cdced627a168d142a809968595fef92f39d036c6c0bb110e2f9bf418c2948012/merged
This roughly matches the size of the docker folder:
# du -sh /var/lib/docker
28G /var/lib/docker
After I run and complete a big job (no errors, all goes idle), I see this:
# df -h
Filesystem Size Used Avail Use% Mounted on
udev 252G 0 252G 0% /dev
tmpfs 51G 4.2M 51G 1% /run
tmpfs 252G 0 252G 0% /dev/shm
tmpfs 5.0M 0 5.0M 0% /run/lock
tmpfs 252G 0 252G 0% /sys/fs/cgroup
/dev/sda2 15G 234M 14G 2% /boot
/dev/sda1 188M 7.8M 180M 5% /boot/efi
tmpfs 51G 0 51G 0% /run/user/1000
tmpfs 51G 0 51G 0% /run/user/501
/dev/sda4 147G 112G 28G 81% /
overlay 147G 112G 28G 81% /var/lib/docker/overlay2/cdced627a168d142a809968595fef92f39d036c6c0bb110e2f9bf418c2948012/merged
root@node33:/var/lib/docker/overlay2# du -sh /var/lib/docker
28G /var/lib/docker
And then eventually after a few jobs it fully runs out of disk space and the cluster dies.
It is showing 112G used, but the docker folder still only holds 28G. I am trying to figure out where the other 84G went, how to contain it, and why it doesn’t come back after jobs complete?
I have tried running full du -shc
on the root filesystem but that doesn’t show the 84G either.
If I ray down
the cluster all the space comes back immediately.
Any suggestions on what might be causing this or where I can look? Thank you!