- Medium: It contributes to significant difficulty to complete my task, but I can work around it.
I am trying to set up a ray cluster on bare metal in my lab. I have gotten a docker image working without much headache, but am having some confusion properly setting up a “cluster” across my machines.
Head node provisiong:
docker run --shm-size=12147483648 -t -i --gpus all --name dagray --network=host -v $(pwd)/code:/code dagutman/dagray
<>ray start --dashboard-ip=0.0.0.0 --head
Result: Works great, can also spin up a jupyter notebook with jupyter notebook --ip=0.0.0.0
- Run the same docker command on another node, and
ray start --address=headnode:6379
So far so good, via the ray dashboard I can see that both nodes are talking to each other, it finds the GPUs properly, etc, etc…
- Where I am getting confused:
So I can add a bind mount to say /myNFS/code:/code so I can see a common file system across each node…
However, in the ray image I created, I note two things:
a) there are files installed in /home/ray (including anaconda3)
b) Before I tried to start a job on the cluster, it would pull some small files to /home/ray/.cache
So in this case, while I could set things up where the /code directory is shared, I am hesitant to also try and mount the “/home/ray” directory in a similar fashion in the docker container. It seems like the default ray docker image may also put stuff in there.
I just pulled the latest rayproject/ray-ml:latest-gpu and there’s numerous likely important files in the /home/ray directory in the container…
I am assuming I am just missing something kind of obvious here…