Ray in docker - init failed due to network does not exists

Greetings.

I was trying to play with Ray in a docker-swarm network.

I establish a Ray head docker through the following:

FROM python:3.6.9

RUN pip3 install ray[default]==1.3.0

RUN mkdir /opt/ray
WORKDIR /opt/ray

COPY entrypoint.sh ./

EXPOSE 28501
EXPOSE 28502
EXPOSE 28503
EXPOSE 28504
EXPOSE 28505
EXPOSE 28506

CMD bash /opt/ray/entrypoint.sh

With the entrypoint being:

ray start --block --head --node-ip-address="ray" --webui-host="ray" --port 28501 --redis-shard-ports 28502 --gcs-server-port 28503 --dashboard-port 28504 --node-manager-port 28505 --object-manager-port 28506 --redis-password="5241590000000000"

The ray address is set-uped through my docker-compose file as an alias in the customnet docker network which itself is an overlay swarm network:

  ray-head:
    image: ray-head
    ports:
      - 28501:28501
      - 28502:28502
      - 28503:28503
      - 28504:28504
      - 28505:28505
      - 28506:28506
    networks:
      customnet:
        aliases:
          - ray
    deploy:
      placement:
        constraints: [node.labels.ray == true]
      mode: replicated
      replicas: 1
      restart_policy:
        condition: any
        delay: 8s

====================================================

After the whole stack is up a set of machines using Ubuntu 20.04.1 LTS, I attached to my ray-head docker, and try run the following inside the python shell:

>>> import ray
>>> ray.init(address="ray:28501", _redis_password='5241590000000000')

It first shows

2021-06-01 17:00:59,082	INFO worker.py:641 -- Connecting to existing Ray cluster at address: 10.103.0.10:28501
{'node_ip_address': '172.18.0.13', 'raylet_ip_address': '172.18.0.13', 'redis_address': '10.103.0.10:28501', 'object_store_address': '/tmp/ray/session_2021-06-01_16-53-14_510667_8/sockets/plasma_store', 'raylet_socket_name': '/tmp/ray/session_2021-06-01_16-53-14_510667_8/sockets/raylet', 'webui_url': '127.0.0.1:28504', 'session_dir': '/tmp/ray/session_2021-06-01_16-53-14_510667_8', 'metrics_export_port': 54641, 'node_id': '213ad33f80f916d30f664ca12c9c9fc448a3ed737b035790132a2937'}

Then I get error constantly being printed to the screen:

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/asyncio/base_events.py", line 1062, in create_server
    sock.bind(sa)
OSError: [Errno 99] Cannot assign requested address

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/site-packages/ray/new_dashboard/agent.py", line 326, in <module>
    loop.run_until_complete(agent.run())
  File "/usr/local/lib/python3.6/asyncio/base_events.py", line 484, in run_until_complete
    return future.result()
  File "/usr/local/lib/python3.6/site-packages/ray/new_dashboard/agent.py", line 161, in run
    await site.start()
  File "/usr/local/lib/python3.6/site-packages/aiohttp/web_runner.py", line 128, in start
    reuse_port=self._reuse_port,
  File "/usr/local/lib/python3.6/asyncio/base_events.py", line 1066, in create_server
    % (sa, err.strerror.lower()))
OSError: [Errno 99] error while attempting to bind on address ('10.103.0.10', 0): cannot assign requested address

Weirdly, 10.103.0.10 is not the right address. It is actually suppose to be 10.103.0.11 as shown in sudo docker network inspect customnet (certain element are hidden because I don’t know what they are):

            "*********": {
                "Name": "ray-head.1.euqdmrkmseg27t4z0e9m4zizq",
                "EndpointID": "*********",
                "MacAddress": "*********",
                "IPv4Address": "10.103.0.11/16",
                "IPv6Address": ""
            },

Also, sudo docker network inspect customnet | grep 10.103.0.10 is empty.

Although within the docker, all of 10.103.0.10 and 10.103.0.11 and ray seems to have correct access to a redis port:

root@e0a1be16da2a:/opt/ray# curl -X GET 10.103.0.10:28501
-ERR wrong number of arguments for 'get' command
-NOAUTH Authentication required.
-ERR unknown command `User-Agent:`, with args beginning with: `curl/7.64.0`,
-ERR unknown command `Accept:`, with args beginning with: `*/*`,
^C
root@e0a1be16da2a:/opt/ray# ^C
root@e0a1be16da2a:/opt/ray# curl -X GET 10.103.0.11:28501
-ERR wrong number of arguments for 'get' command
-NOAUTH Authentication required.
-ERR unknown command `User-Agent:`, with args beginning with: `curl/7.64.0`,
-ERR unknown command `Accept:`, with args beginning with: `*/*`,
^C
root@e0a1be16da2a:/opt/ray# ^C
root@e0a1be16da2a:/opt/ray# curl -X GET ray:28501
-ERR wrong number of arguments for 'get' command
-NOAUTH Authentication required.
-ERR unknown command `User-Agent:`, with args beginning with: `curl/7.64.0`,
-ERR unknown command `Accept:`, with args beginning with: `*/*`,
^C
root@e0a1be16da2a:/opt/ray# e

====================================================

I also tried to access this container from another container - where I intentionally wanted to put my ray driver code here.

This container is derived from tensorflow/tensorflow:2.2.1, and ray library and some other dependency manually installed, but does not have ray head running.

Here’s how I tried the access:

>>> import ray
>>> ray.init(address="ray:28501", _redis_password='5241590000000000')

It failed while showing:

2021-06-01 17:12:12,932 INFO worker.py:641 -- Connecting to existing Ray cluster at address: 10.103.0.10:2850
Aborted (core dumped)
1 Like

oh also - if I do not talk to ray or 10.103.0.10 but explicitly talk to 10.103.0.11, I got this:

>>> ray.init(address="10.103.0.11:28501", _redis_password='5241590000000000')
2021-06-01 18:00:43,188	INFO worker.py:641 -- Connecting to existing Ray cluster at address: 10.103.0.11:28501
2021-06-01 18:00:43,195	WARNING services.py:311 -- Some processes that the driver needs to connect to have not registered with Redis, so retrying. Have you run 'ray start' on this node?
2021-06-01 18:00:44,201	WARNING services.py:311 -- Some processes that the driver needs to connect to have not registered with Redis, so retrying. Have you run 'ray start' on this node?
2021-06-01 18:00:45,208	WARNING services.py:311 -- Some processes that the driver needs to connect to have not registered with Redis, so retrying. Have you run 'ray start' on this node?
2021-06-01 18:00:46,214	WARNING services.py:311 -- Some processes that the driver needs to connect to have not registered with Redis, so retrying. Have you run 'ray start' on this node?
2021-06-01 18:00:47,220	WARNING services.py:311 -- Some processes that the driver needs to connect to have not registered with Redis, so retrying. Have you run 'ray start' on this node?
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.6/site-packages/ray/_private/client_mode_hook.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.6/site-packages/ray/worker.py", line 741, in init
    connect_only=True)
  File "/usr/local/lib/python3.6/site-packages/ray/node.py", line 190, in __init__
    redis_password=self.redis_password))
  File "/usr/local/lib/python3.6/site-packages/ray/_private/services.py", line 305, in get_address_info_from_redis
    redis_address, node_ip_address, redis_password=redis_password)
  File "/usr/local/lib/python3.6/site-packages/ray/_private/services.py", line 279, in get_address_info_from_redis_helper
    f"This node has an IP address of {node_ip_address}, and Ray "
RuntimeError: This node has an IP address of 172.18.0.13, and Ray expects this IP address to be either the Redis address or one of the Raylet addresses. Connected to Redis at 10.103.0.11:28501 and found raylets at 10.103.0.10 but none of these match this node's IP 172.18.0.13. Are any of these actually a different IP address for the same node?You might need to provide --node-ip-address to specify the IP address that the head should use when sending to this node

I think I figure out the case

instead of using

--node-ip-address="ray"

use

--node-ip-address="tasks.ray"

not exactly sure where this difference come from

Is there an expert who can chime in on the best way to use docker compose and networks to start up an ecosystem of ray workers and head?