How severe does this issue affect your experience of using Ray?
- Medium: It contributes to significant difficulty to complete my task, but I can work around it.
Hi we are running ray on a on premise cluster and we are experiencing some object spilling related access denied errors. We are using an NFS as the target storage and we where wondering if there was an issue with the configuration or if even using an NFS for object spilling was ok.
This is how the head node is set up:
ray start --head --node-ip-address 192.168.129.112 --system-config='{"object_spilling_config":"{\"type\":\"filesystem\",\"params\":{\"directory_path\":\"/usr/local/custom/nfs/ray/tmp/spill\"}}"}'
and slave nodes connect like this:
ray start --address='192.168.129.112:6379'
our code script is run from the master node with this init:
ray.init(address='auto', _node_ip_address='192.168.129.112')
The slave machines are diskless so we set up a shared NFS /usr/local/custom/nfs
mount on all the machines. Here we also store the code and the dependencies so that all the nodes can see them.
All the machines have the same User ID and group ID and the NFS folder permissions appear to be correct.
What happens is that during the execution spilling seems to happen correctly, but after a while suddenly there are permission denied errors of a worker trying to access in read/write the files in the spillover folder. This eventually leads to crash of the entire job.
Our current theory is that everything works fine until a different worker node from the one that generated the files tries to access them and fails for some unknows reason.
Other theory is that the spillover was supposed to be a private storage space and not being shared with all the other nodes, which creates file name conflicts or multiple nodes trying to write at the same file at the same time.
So is using NFS this way correct?
Are there any ideas on what could be going wrong here?