Ray head node regularly using up all host disk space

Hello, I’m hoping someone can help me figure out a Ray/docker disk space usage mystery.

I am running big jobs on a cluster, and the head node regularly consumes all available disk space on its host causing the cluster to die.

I am trying to figure out what is consuming the space, but the numbers I am seeing don’t add up. When I first start the cluster, the disk utilization looks like this (see the last two lines with 28G used):

# df -h
Filesystem                                              Size  Used Avail Use% Mounted on
udev                                                    252G     0  252G   0% /dev
tmpfs                                                    51G  3.7M   51G   1% /run
tmpfs                                                   252G     0  252G   0% /dev/shm
tmpfs                                                   5.0M     0  5.0M   0% /run/lock
tmpfs                                                   252G     0  252G   0% /sys/fs/cgroup
/dev/sda2                                                15G  234M   14G   2% /boot
/dev/sda1                                               188M  7.8M  180M   5% /boot/efi
tmpfs                                                    51G     0   51G   0% /run/user/1000
/dev/sda4                                               147G   28G  114G  19% /
overlay                                                 147G   28G  114G  19% /var/lib/docker/overlay2/cdced627a168d142a809968595fef92f39d036c6c0bb110e2f9bf418c2948012/merged

This roughly matches the size of the docker folder:

# du -sh /var/lib/docker
28G	/var/lib/docker

After I run and complete a big job (no errors, all goes idle), I see this:

# df -h
Filesystem                                              Size  Used Avail Use% Mounted on
udev                                                    252G     0  252G   0% /dev
tmpfs                                                    51G  4.2M   51G   1% /run
tmpfs                                                   252G     0  252G   0% /dev/shm
tmpfs                                                   5.0M     0  5.0M   0% /run/lock
tmpfs                                                   252G     0  252G   0% /sys/fs/cgroup
/dev/sda2                                                15G  234M   14G   2% /boot
/dev/sda1                                               188M  7.8M  180M   5% /boot/efi
tmpfs                                                    51G     0   51G   0% /run/user/1000
tmpfs                                                    51G     0   51G   0% /run/user/501
/dev/sda4                                               147G  112G   28G  81% /
overlay                                                 147G  112G   28G  81% /var/lib/docker/overlay2/cdced627a168d142a809968595fef92f39d036c6c0bb110e2f9bf418c2948012/merged
root@node33:/var/lib/docker/overlay2# du -sh /var/lib/docker
28G	/var/lib/docker

And then eventually after a few jobs it fully runs out of disk space and the cluster dies.

It is showing 112G used, but the docker folder still only holds 28G. I am trying to figure out where the other 84G went, how to contain it, and why it doesn’t come back after jobs complete?

I have tried running full du -shc on the root filesystem but that doesn’t show the 84G either.

If I ray down the cluster all the space comes back immediately.

Any suggestions on what might be causing this or where I can look? Thank you!