Greetings.
I was trying to play with Ray in a docker-swarm network.
I establish a Ray head docker through the following:
FROM python:3.6.9
RUN pip3 install ray[default]==1.3.0
RUN mkdir /opt/ray
WORKDIR /opt/ray
COPY entrypoint.sh ./
EXPOSE 28501
EXPOSE 28502
EXPOSE 28503
EXPOSE 28504
EXPOSE 28505
EXPOSE 28506
CMD bash /opt/ray/entrypoint.sh
With the entrypoint being:
ray start --block --head --node-ip-address="ray" --webui-host="ray" --port 28501 --redis-shard-ports 28502 --gcs-server-port 28503 --dashboard-port 28504 --node-manager-port 28505 --object-manager-port 28506 --redis-password="5241590000000000"
The ray
address is set-uped through my docker-compose file as an alias in the customnet
docker network which itself is an overlay swarm network:
ray-head:
image: ray-head
ports:
- 28501:28501
- 28502:28502
- 28503:28503
- 28504:28504
- 28505:28505
- 28506:28506
networks:
customnet:
aliases:
- ray
deploy:
placement:
constraints: [node.labels.ray == true]
mode: replicated
replicas: 1
restart_policy:
condition: any
delay: 8s
====================================================
After the whole stack is up a set of machines using Ubuntu 20.04.1 LTS
, I attached to my ray-head docker, and try run the following inside the python shell:
>>> import ray
>>> ray.init(address="ray:28501", _redis_password='5241590000000000')
It first shows
2021-06-01 17:00:59,082 INFO worker.py:641 -- Connecting to existing Ray cluster at address: 10.103.0.10:28501
{'node_ip_address': '172.18.0.13', 'raylet_ip_address': '172.18.0.13', 'redis_address': '10.103.0.10:28501', 'object_store_address': '/tmp/ray/session_2021-06-01_16-53-14_510667_8/sockets/plasma_store', 'raylet_socket_name': '/tmp/ray/session_2021-06-01_16-53-14_510667_8/sockets/raylet', 'webui_url': '127.0.0.1:28504', 'session_dir': '/tmp/ray/session_2021-06-01_16-53-14_510667_8', 'metrics_export_port': 54641, 'node_id': '213ad33f80f916d30f664ca12c9c9fc448a3ed737b035790132a2937'}
Then I get error constantly being printed to the screen:
Traceback (most recent call last):
File "/usr/local/lib/python3.6/asyncio/base_events.py", line 1062, in create_server
sock.bind(sa)
OSError: [Errno 99] Cannot assign requested address
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.6/site-packages/ray/new_dashboard/agent.py", line 326, in <module>
loop.run_until_complete(agent.run())
File "/usr/local/lib/python3.6/asyncio/base_events.py", line 484, in run_until_complete
return future.result()
File "/usr/local/lib/python3.6/site-packages/ray/new_dashboard/agent.py", line 161, in run
await site.start()
File "/usr/local/lib/python3.6/site-packages/aiohttp/web_runner.py", line 128, in start
reuse_port=self._reuse_port,
File "/usr/local/lib/python3.6/asyncio/base_events.py", line 1066, in create_server
% (sa, err.strerror.lower()))
OSError: [Errno 99] error while attempting to bind on address ('10.103.0.10', 0): cannot assign requested address
Weirdly, 10.103.0.10
is not the right address. It is actually suppose to be 10.103.0.11
as shown in sudo docker network inspect customnet
(certain element are hidden because I don’t know what they are):
"*********": {
"Name": "ray-head.1.euqdmrkmseg27t4z0e9m4zizq",
"EndpointID": "*********",
"MacAddress": "*********",
"IPv4Address": "10.103.0.11/16",
"IPv6Address": ""
},
Also, sudo docker network inspect customnet | grep 10.103.0.10
is empty.
Although within the docker, all of 10.103.0.10
and 10.103.0.11
and ray
seems to have correct access to a redis port:
root@e0a1be16da2a:/opt/ray# curl -X GET 10.103.0.10:28501
-ERR wrong number of arguments for 'get' command
-NOAUTH Authentication required.
-ERR unknown command `User-Agent:`, with args beginning with: `curl/7.64.0`,
-ERR unknown command `Accept:`, with args beginning with: `*/*`,
^C
root@e0a1be16da2a:/opt/ray# ^C
root@e0a1be16da2a:/opt/ray# curl -X GET 10.103.0.11:28501
-ERR wrong number of arguments for 'get' command
-NOAUTH Authentication required.
-ERR unknown command `User-Agent:`, with args beginning with: `curl/7.64.0`,
-ERR unknown command `Accept:`, with args beginning with: `*/*`,
^C
root@e0a1be16da2a:/opt/ray# ^C
root@e0a1be16da2a:/opt/ray# curl -X GET ray:28501
-ERR wrong number of arguments for 'get' command
-NOAUTH Authentication required.
-ERR unknown command `User-Agent:`, with args beginning with: `curl/7.64.0`,
-ERR unknown command `Accept:`, with args beginning with: `*/*`,
^C
root@e0a1be16da2a:/opt/ray# e
====================================================
I also tried to access this container from another container - where I intentionally wanted to put my ray driver code here.
This container is derived from tensorflow/tensorflow:2.2.1
, and ray library and some other dependency manually installed, but does not have ray head running.
Here’s how I tried the access:
>>> import ray
>>> ray.init(address="ray:28501", _redis_password='5241590000000000')
It failed while showing:
2021-06-01 17:12:12,932 INFO worker.py:641 -- Connecting to existing Ray cluster at address: 10.103.0.10:2850
Aborted (core dumped)