Running Ray on Local Cluster / File Sync Question

dagutman · February 7, 2023, 5:32pm

Medium: It contributes to significant difficulty to complete my task, but I can work around it.

I am trying to set up a ray cluster on bare metal in my lab. I have gotten a docker image working without much headache, but am having some confusion properly setting up a “cluster” across my machines.

Head node provisiong:

docker run --shm-size=12147483648 -t -i --gpus all --name dagray --network=host -v $(pwd)/code:/code dagutman/dagray

<>ray start --dashboard-ip=0.0.0.0 --head

Result: Works great, can also spin up a jupyter notebook with jupyter notebook --ip=0.0.0.0

Run the same docker command on another node, and
ray start --address=headnode:6379

So far so good, via the ray dashboard I can see that both nodes are talking to each other, it finds the GPUs properly, etc, etc…

Where I am getting confused:

So I can add a bind mount to say /myNFS/code:/code so I can see a common file system across each node…

However, in the ray image I created, I note two things:
a) there are files installed in /home/ray (including anaconda3)
b) Before I tried to start a job on the cluster, it would pull some small files to /home/ray/.cache

So in this case, while I could set things up where the /code directory is shared, I am hesitant to also try and mount the “/home/ray” directory in a similar fashion in the docker container. It seems like the default ray docker image may also put stuff in there.

I just pulled the latest rayproject/ray-ml:latest-gpu and there’s numerous likely important files in the /home/ray directory in the container…

I am assuming I am just missing something kind of obvious here…

Topic		Replies	Views
How to copy (non-pip) dependencies to cluster nodes	5	1719	March 26, 2021
Can I use my own docker image when I deploying a multi-node local ray cluster? Ray Clusters	0	456	August 26, 2021
Silent Connection Failure (Ray on Docker)	11	633	February 18, 2021
Implementing Ray with multiple docker on single machine Ray Core	3	730	February 8, 2022
How to run the script distributedly? Ray Clusters	4	575	May 9, 2021

Running Ray on Local Cluster / File Sync Question

Head node provisiong:

Related topics