How severe does this issue affect your experience of using Ray?
- High: It blocks me to complete my task.
I’m trying to setup ray in a Docker container but struggling on connecting to it when it’s running inside the container.
First I start a redis server like this:
❯ docker run -d --name redis-server -p 6379:6379 valkey/valkey:7.2
bf911dd78602ac421a39a5150903811d2e92e068d70700f8e7c58e012371a670
❯ docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
bf911dd78602 valkey/valkey:7.2 "docker-entrypoint.s…" 6 seconds ago Up 5 seconds 0.0.0.0:6379->6379/tcp redis-server
When I run:
ray start --head --verbose --node-ip-address=0.0.0.0 --disable-usage-stats --port=6023 --include-dashboard=True --dashboard-host=0.0.0.0 --dashboard-port=8266 --ray-client-server-port=10001 --resources='{"num-cpus": 0, "num-gpus": 0}' --block
and I try to connect to it, I get:
❯ python
Python 3.11.9 | packaged by conda-forge | (main, Apr 19 2024, 18:34:54) [Clang 16.0.6 ] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import ray
>>> ray.init(address="127.0.0.1:6023", logging_level="debug")
2024-05-23 07:52:20,589 DEBUG worker.py:1491 -- Automatically increasing RLIMIT_NOFILE to max value of 9223372036854775807
2024-05-23 07:52:20,590 INFO worker.py:1564 -- Connecting to existing Ray cluster at address: 127.0.0.1:6023...
2024-05-23 07:52:20,603 INFO worker.py:1740 -- Connected to Ray cluster. View the dashboard at http://0.0.0.0:8266
RayContext(dashboard_url='0.0.0.0:8266', python_version='3.11.9', ray_version='2.20.0', ray_commit='5708e75978413e46c703e44f43fd89769f3c148b')
Everything works fine.
However, if I try to run the same command from a docker container like this:
❯ docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
bf911dd78602 valkey/valkey:7.2 "docker-entrypoint.s…" 6 minutes ago Up 6 minutes 0.0.0.0:6379->6379/tcp redis-server
❯ docker inspect -f '{{range.NetworkSettings.Networks}}{{.IPAddress}}{{end}}' bf911dd78602
172.17.0.2
❯ docker run -d --cpus="4" --memory="2G" --memory-reservation="1G" --shm-size="4gb" -e RAY_REDIS_ADDRESS=172.17.0.2:6379 -p 6023:6023 -p 8266:8266 -p 10001:10001 ray_head ray start --head --verbose --node-ip-address=0.0.0.0 --disable-usage-stats --port=6023 --include-dashboard=True --dashboard-host=0.0.0.0 --dashboard-port=8266 --ray-client-server-port=10001 --resources='{"num-cpus": 0, "num-gpus": 0}' --block
❯ docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
843b1de30422 ray_head "/app/bin/entrypoint…" 4 seconds ago Up 3 seconds 0.0.0.0:6023->6023/tcp, 0.0.0.0:8266->8266/tcp, 0.0.0.0:10001->10001/tcp sharp_clarke
bf911dd78602 valkey/valkey:7.2 "docker-entrypoint.s…" 8 minutes ago Up 8 minutes 0.0.0.0:6379->6379/tcp redis-server
❯ docker logs 843b1de30422
Running ray head in the ray_env virtualenv
2024-05-22 23:29:14,619 INFO usage_lib.py:443 -- Usage stats collection is disabled.
2024-05-22 23:29:14,619 INFO scripts.py:764 -- Local node IP: 0.0.0.0
2024-05-22 23:29:16,724 WARNING utils.py:580 -- Detecting docker specified CPUs. In previous versions of Ray, CPU detection in containers was incorrect. Please ensure that Ray has enough CPUs allocated. As a temporary workaround to revert to the prior behavior, set `RAY_USE_MULTIPROCESSING_CPU_COUNT=1` as an env var before starting Ray. Set the env var: `RAY_DISABLE_DOCKER_CPU_WARNING=1` to mute this warning.
[2024-05-22 23:29:16,761 W 9 9] (ray_init) global_state_accessor.cc:437: Retrying to get node with node ID bf9ba50a43bb4f2cfabc348d1b201f24e1fc173c0c4f5735ce31ed7d
[2024-05-22 23:29:17,764 W 9 9] (ray_init) global_state_accessor.cc:437: Retrying to get node with node ID bf9ba50a43bb4f2cfabc348d1b201f24e1fc173c0c4f5735ce31ed7d
2024-05-22 23:29:18,772 SUCC scripts.py:801 -- --------------------
2024-05-22 23:29:18,773 SUCC scripts.py:802 -- Ray runtime started.
2024-05-22 23:29:18,773 SUCC scripts.py:803 -- --------------------
Then I get stuck on ray.init
:
❯ docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
843b1de30422 ray_head "/app/bin/entrypoint…" 3 minutes ago Up 3 minutes 0.0.0.0:6023->6023/tcp, 0.0.0.0:8266->8266/tcp, 0.0.0.0:10001->10001/tcp sharp_clarke
bf911dd78602 valkey/valkey:7.2 "docker-entrypoint.s…" 11 minutes ago Up 11 minutes 0.0.0.0:6379->6379/tcp redis-server
❯ docker inspect -f '{{range.NetworkSettings.Networks}}{{.IPAddress}}{{end}}' 843b1de30422
172.17.0.3
❯ export RAY_REDIS_ADDRESS=172.17.0.2:6379
❯ python
Python 3.11.9 | packaged by conda-forge | (main, Apr 19 2024, 18:34:54) [Clang 16.0.6 ] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import ray
>>> ray.init(address="172.17.0.3:6023", logging_level="debug")
2024-05-23 08:35:04,296 DEBUG worker.py:1491 -- Automatically increasing RLIMIT_NOFILE to max value of 9223372036854775807
2024-05-23 08:35:04,296 INFO worker.py:1564 -- Connecting to existing Ray cluster at address: 172.17.0.3:6023...
2024-05-23 08:35:16,344 DEBUG node.py:744 -- Connecting to GCS: Traceback (most recent call last):
File "/Users/ekami/Programs/micromamba/envs/scribble/lib/python3.11/site-packages/ray/_private/node.py", line 728, in _init_gcs_client
client = GcsClient(
^^^^^^^^^^
File "python/ray/_raylet.pyx", line 2709, in ray._raylet.GcsClient.__cinit__
File "python/ray/_raylet.pyx", line 2719, in ray._raylet.GcsClient._connect
File "python/ray/_raylet.pyx", line 590, in ray._raylet.check_status
ray.exceptions.RaySystemError: System error: RPC Error message: failed to connect to all addresses; last error: UNAVAILABLE: ipv4:172.17.0.3:6023: Failed to connect to remote host: FD shutdown; RPC Error details:
Any idea what the issue might be? Thanks!