Running Ray on Local Cluster / File Sync Question

  • Medium: It contributes to significant difficulty to complete my task, but I can work around it.

I am trying to set up a ray cluster on bare metal in my lab. I have gotten a docker image working without much headache, but am having some confusion properly setting up a “cluster” across my machines.

Head node provisiong:

docker run --shm-size=12147483648 -t -i --gpus all --name dagray --network=host -v $(pwd)/code:/code dagutman/dagray

<>ray start --dashboard-ip=0.0.0.0 --head

Result: Works great, can also spin up a jupyter notebook with jupyter notebook --ip=0.0.0.0

  1. Run the same docker command on another node, and
    ray start --address=headnode:6379

So far so good, via the ray dashboard I can see that both nodes are talking to each other, it finds the GPUs properly, etc, etc…

  1. Where I am getting confused:

So I can add a bind mount to say /myNFS/code:/code so I can see a common file system across each node…

However, in the ray image I created, I note two things:
a) there are files installed in /home/ray (including anaconda3)
b) Before I tried to start a job on the cluster, it would pull some small files to /home/ray/.cache

So in this case, while I could set things up where the /code directory is shared, I am hesitant to also try and mount the “/home/ray” directory in a similar fashion in the docker container. It seems like the default ray docker image may also put stuff in there.

I just pulled the latest rayproject/ray-ml:latest-gpu and there’s numerous likely important files in the /home/ray directory in the container…

I am assuming I am just missing something kind of obvious here…