Object spilling in cluster mode on NFS errors

How severe does this issue affect your experience of using Ray?

  • Medium: It contributes to significant difficulty to complete my task, but I can work around it.

Hi we are running ray on a on premise cluster and we are experiencing some object spilling related access denied errors. We are using an NFS as the target storage and we where wondering if there was an issue with the configuration or if even using an NFS for object spilling was ok.

This is how the head node is set up:

ray start --head --node-ip-address 192.168.129.112 --system-config='{"object_spilling_config":"{\"type\":\"filesystem\",\"params\":{\"directory_path\":\"/usr/local/custom/nfs/ray/tmp/spill\"}}"}'

and slave nodes connect like this:

ray start --address='192.168.129.112:6379'

our code script is run from the master node with this init:

ray.init(address='auto', _node_ip_address='192.168.129.112')

The slave machines are diskless so we set up a shared NFS /usr/local/custom/nfs mount on all the machines. Here we also store the code and the dependencies so that all the nodes can see them.

All the machines have the same User ID and group ID and the NFS folder permissions appear to be correct.

What happens is that during the execution spilling seems to happen correctly, but after a while suddenly there are permission denied errors of a worker trying to access in read/write the files in the spillover folder. This eventually leads to crash of the entire job.

Our current theory is that everything works fine until a different worker node from the one that generated the files tries to access them and fails for some unknows reason.
Other theory is that the spillover was supposed to be a private storage space and not being shared with all the other nodes, which creates file name conflicts or multiple nodes trying to write at the same file at the same time.

So is using NFS this way correct?

Are there any ideas on what could be going wrong here?

hi @Al12rs

Other theory is that the spillover was supposed to be a private storage space and not being shared with all the other nodes, which creates file name conflicts or multiple nodes trying to write at the same file at the same time.

This is probably the cause. Ray doesn’t expect the spilled files being shared across multiple nodes. So most likely you hit the error when two worker nodes trying to access the same file.

Hi, in the end we found out that some of the machines didn’t in fact share the same User ID and Group ID like they where supposed to. After correcting that we found that that it worked correctly.

So for anyone stumbling on this in the future, NFS appears to be working as storage solution for spillover as long as all the nodes have read/write permission, for example being all part of a group sharing the same GID.

As a downside we found that even with spillover working correctly Ray would still crash in memory bound situations after killing too many workers. Often it wouldn’t even use spillover in those cases even though the shared object storage was around 6GB which could have freed up Ram space for the rest.

Just a note, spillover only applies for Ray objects (i.e. values of ObjectRefs), not for worker heap memory. Most likely if you are seeing workers killed, it’s because of the latter. We’re actively working on a separate solution for heap memory, and if this applies to you, you should check it out here!