How severe does this issue affect your experience of using Ray?
- High: It blocks me to complete my task.
Hi, I’m trying to set up a virtual Ray cluster to learn and demonstrate various features of Ray. I could do this stand-alone mode or set up actual virtual machines, but using docker is so much more convenient (if it worked).
Here is what I’m doing:
I have a very simple docker image. I just installed a couple of utilities to make debugging easier:
FROM continuumio/miniconda3
RUN apt update && apt install -y iputils-ping iproute2
RUN pip install "ray[all]"
I create my own network, but please note that I have also tried this without creating this cluster: docker network create simulated-cluster
I start the head node via this:
docker run \
-dit \
--network simulated-cluster \
-p 6379:6379 -p 8265:8265 -p 10001:10001 -p 10002:10002 \
--name ray-head \
test-ray-image \
ray start --head --node-ip-address=0.0.0.0 --dashboard-host=0.0.0.0 --disable-usage-stats --block
I confirm that this works and get its internal IP, which is always 172.18.0.2
I then start worker nodes:
!docker run -dit --network simulated-cluster --name ray-worker1 test-ray-image ray start --address=ray-head:6379 --block
!docker run -dit --network simulated-cluster --name ray-worker2 test-ray-image ray start --address=ray-head:6379 --block
!docker run -dit --network simulated-cluster --name ray-worker3 test-ray-image ray start --address=ray-head:6379 --block
If I’m not using the simulated network, then worker nodes point to the IP I retrieved via addr ip
.
Logs seem to show that worker nodes are running ok as well. What’s more, I can access the dashboard via http://localhost:8265!
It seems to me that the network is up! I now tried to do some computation on it. Note that the next bit of code is being run on the host machine, not inside any of the docker containers:
ray.init("ray://localhost:6379")
@ray.remote
def test():
return "Hello from Ray!"
The the host name, I have tried localhost, the IP from addr ip
, the port 10001 and countless other permutations but I just can’t get it to work. Sometimes this code just sits there forever, in this case, I have been getting connection timeouts.
Any ideas what I could be doing wrong?