Silent Connection Failure (Ray on Docker)

I’m trying to follow the manual ray cluster initiation instructions, but with no luck. I’d appreciate any help here! Due to the security regime I cannot use ray-up or the Kubernetes approach.

Each node is a separate AWS EC2 instance, where the security group has been configured to allow incoming traffic on port 6379 and all outgoing traffic.

Head Node:

(base) [root@1eb0c9fea2e0 beta]# ray start --head --port=6379
Local node IP: 172.17.0.2
2021-02-04 20:27:43,916 INFO services.py:1171 -- View the Ray dashboard at http://localhost:8265
2021-02-04 20:27:43,918 WARNING services.py:1632 -- WARNING: The object store is using /tmp instead of /dev/shm because /dev/shm has only 67108864 bytes available. This may slow down performance! You may be able to free up space by deleting files in /dev/shm or terminating any running plasma_store_server processes. If you are inside a Docker container, you may need to pass an argument with the flag '--shm-size' to 'docker run'.

--------------------
Ray runtime started.
--------------------

Next steps
  To connect to this Ray runtime from another node, run
    ray start --address='172.17.0.2:6379' --redis-password='5241590000000000'

  Alternatively, use the following Python code:
    import ray
    ray.init(address='auto', _redis_password='5241590000000000')

  If connection fails, check your firewall settings and network configuration.

  To terminate the Ray runtime, run
    ray stop

Worker node:

(base) [root@e6be84eb8fae /]# ray start --address="10.251.66.9:6379" --redis-password='5241590000000000'
Local node IP: 172.17.0.2
2021-02-04 20:34:39,647 WARNING services.py:1632 -- WARNING: The object store is using /tmp instead of /dev/shm because /dev/shm has only 67108864 bytes available. This may slow down performance! You may be able to free up space by deleting files in /dev/shm or terminating any running plasma_store_server processes. If you are inside a Docker container, you may need to pass an argument with the flag '--shm-size' to 'docker run'.

--------------------
Ray runtime started.
--------------------

To terminate the Ray runtime, run
  ray stop

(base) [root@e6be84eb8fae /]# ray timeline
Traceback (most recent call last):
  File "/opt/conda/bin/ray", line 8, in <module>
    sys.exit(main())
  File "/opt/conda/lib/python3.8/site-packages/ray/scripts/scripts.py", line 1504, in main
    return cli()
  File "/opt/conda/lib/python3.8/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/opt/conda/lib/python3.8/site-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/opt/conda/lib/python3.8/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/opt/conda/lib/python3.8/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/ray/scripts/scripts.py", line 1346, in timeline
    address = services.get_ray_address_to_use_or_die()
  File "/opt/conda/lib/python3.8/site-packages/ray/_private/services.py", line 221, in get_ray_address_to_use_or_die
    return find_redis_address_or_die()
  File "/opt/conda/lib/python3.8/site-packages/ray/_private/services.py", line 233, in find_redis_address_or_die
    raise ConnectionError(
ConnectionError: Could not find any running Ray instance. Please specify the one to connect to by setting `address`.

The worker node claims to have connected to the head, but in actuality it did not. Note that the IP address has been changed to match the actual private IP.

Any tips?

BR,
Ryan

Can you run a driver in a head node and run

import ray
ray.init(address='auto')
print(ray.nodes())

?

Solved! I disabled firewalld and also ran the docker containers in net mode (e.g. docker run --net=host). The docs helped: Using Ray and Docker on a Cluster (EXPERIMENTAL) — Ray 0.01 documentation.

(base) [root@... beta]# python
Python 3.8.3 (default, May 19 2020, 18:47:26)
[GCC 7.3.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.

>>> import ray

>>> ray.init(address='auto')
2021-02-05 13:23:03,185 INFO worker.py:656 -- Connecting to existing Ray cluster at address: ...:6379
{'node_ip_address': 'redacted', 'raylet_ip_address': 'redacted', 'redis_address': 'redacted', 'object_store_address': '/tmp/ray/session_2021-02-05_13-20-31_075554_59/sockets/plasma_store', 'raylet_socket_name': '/tmp/ray/session_2021-02-05_13-20-31_075554_59/sockets/raylet', 'webui_url': 'localhost:8265', 'session_dir': '/tmp/ray/session_2021-02-05_13-20-31_075554_59', 'metrics_export_port': 51294, 'node_id': '842803b921f30442ab20899547aac005a71db281'}

>>> ray.nodes()
[{'NodeID': '6d58398ad5f8882e80f4295701785d5518803bd5', 'Alive': True, 'NodeManagerAddress': 'redacted', 'NodeManagerHostname': 'redacted', 'NodeManagerPort': 61801, 'ObjectManagerPort': 8076, 'ObjectStoreSocketName': '/tmp/ray/session_2021-02-05_13-20-31_075554_59/sockets/plasma_store', 'RayletSocketName': '/tmp/ray/session_2021-02-05_13-20-31_075554_59/sockets/raylet', 'MetricsExportPort': 62247, 'alive': True, 'Resources': {'node:redacted': 1.0, 'memory': 213.0, 'object_store_memory': 63.0, 'CPU': 8.0}}, {'NodeID': '842803b921f30442ab20899547aac005a71db281', 'Alive': True, 'NodeManagerAddress': 'redacted, 'NodeManagerHostname': 'redacted', 'NodeManagerPort': 47287, 'ObjectManagerPort': 8076, 'ObjectStoreSocketName': '/tmp/ray/session_2021-02-05_13-20-31_075554_59/sockets/plasma_store', 'RayletSocketName': '/tmp/ray/session_2021-02-05_13-20-31_075554_59/sockets/raylet', 'MetricsExportPort': 46034, 'alive': True, 'Resources': {'memory': 183.0, 'node:10.251.66.9': 1.0, 'CPU': 8.0, 'object_store_memory': 63.0}}]

Awesome! Happy that it was solved! It is interesting you are looking at really really old document haha. Is there anything that we are missing from this doc (Installing Ray — Ray v1.1.0)? @ijrsvt We should probably consider adding them.

I would consider adding the following, given the user is interested in running Ray in cluster mode as docker containers on hosts, s.th. each host (e.g. an AWS EC2 instance) runs a single Ray docker container.

The ports used must be forwarded from local to docker, e.g. docker run -p 6379:6379 -p 8076:8076 .... Failing that, the docker image can be run in host mode, e.g. docker run --net=host ... s.th. the docker container has the same ip address and ports as the host machine. I also tend to run in detached mode, so docker run -d -t .... For my Ray docker image, which is a custom CentOS7 Miniconda image (similar Debian image here), the winning command is similar to:
docker run -d -t --name xray --net=host --shm-size=4G docker_repo/ray_app:latest

It would also be good to suggest that a custom dockerfile should expose the Ray ports, and list all critical Ray ports (e.g. 6379, 8076, are there others?).

If on AWS or another cloud service, the user should confirm that the security group allows incoming and outgoing TCP traffic on critical Ray ports.

And finally it’s worthwhile to check if firewalld is blocking Ray during troubleshooting if the user is running a CentOS or RHEL image.

Cheers!

Edit (the aforementioned suggestions work for Docker 19.03.)

Hey @ijrsvt Here’s a great feedback regarding how we should improve the Docker setup document. I personally saw lots of people trying to figure this out and failed before. Let me post a Github issue here.

Hey @sangcho I’d like to comment that running docker containers in host mode is not an ideal solution. In host mode only one docker container is allowed to run on each host. For example it prevents me from deploying multiple ray docker containers running on separate ports (e.g. 2 different docker containers running on 3 EC2 instances for a total of 6 containers). This may be desirable in the case that each container is an app that can be called (perhaps as an api), and each app requires different and conflicting libraries (e.g. llvmlite 0.35 vs 0.36), but I don’t want to create two separate EC2 clusters to host the two apps.

Is there a better solution? Can we tell ray to listen on all network interfaces (e.g. listen to 0.0.0.0) s.th. I need not run docker containers for ray in host mode?

@ryan-chien What do you mean that it is not possible to run multiple containers in host mode? I think it should be fine if you are using separate Redis addresses or different clusters?

Yep I am wrong here - it is possible to run multiple containers on the same host in host mode. Sorry for the bad info!

So, the action item here is probably just improving the documentation right?

Yes, that would be great.

1 Like

Hey @ijrsvt will you have some time to update the doc? (if you are not busy).