Ray head only connects outside Docker (simplified)

How severe does this issue affect your experience of using Ray?

  • High: It blocks me to complete my task.

I’m trying to setup ray in a Docker container but struggling on connecting to it when it’s running inside the container.
First I start a redis server like this:

❯ docker run -d --name redis-server -p 6379:6379 valkey/valkey:7.2
bf911dd78602ac421a39a5150903811d2e92e068d70700f8e7c58e012371a670
❯ docker ps
CONTAINER ID   IMAGE               COMMAND                  CREATED         STATUS         PORTS                    NAMES
bf911dd78602   valkey/valkey:7.2   "docker-entrypoint.s…"   6 seconds ago   Up 5 seconds   0.0.0.0:6379->6379/tcp   redis-server

When I run:

ray start --head --verbose --node-ip-address=0.0.0.0 --disable-usage-stats --port=6023 --include-dashboard=True --dashboard-host=0.0.0.0 --dashboard-port=8266 --ray-client-server-port=10001 --resources='{"num-cpus": 0, "num-gpus": 0}' --block

and I try to connect to it, I get:

❯ python
Python 3.11.9 | packaged by conda-forge | (main, Apr 19 2024, 18:34:54) [Clang 16.0.6 ] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import ray
>>> ray.init(address="127.0.0.1:6023", logging_level="debug")
2024-05-23 07:52:20,589 DEBUG worker.py:1491 -- Automatically increasing RLIMIT_NOFILE to max value of 9223372036854775807
2024-05-23 07:52:20,590 INFO worker.py:1564 -- Connecting to existing Ray cluster at address: 127.0.0.1:6023...
2024-05-23 07:52:20,603 INFO worker.py:1740 -- Connected to Ray cluster. View the dashboard at http://0.0.0.0:8266
RayContext(dashboard_url='0.0.0.0:8266', python_version='3.11.9', ray_version='2.20.0', ray_commit='5708e75978413e46c703e44f43fd89769f3c148b')

Everything works fine.
However, if I try to run the same command from a docker container like this:

❯ docker ps
CONTAINER ID   IMAGE               COMMAND                  CREATED         STATUS         PORTS                    NAMES
bf911dd78602   valkey/valkey:7.2   "docker-entrypoint.s…"   6 minutes ago   Up 6 minutes   0.0.0.0:6379->6379/tcp   redis-server
❯ docker inspect -f '{{range.NetworkSettings.Networks}}{{.IPAddress}}{{end}}' bf911dd78602
172.17.0.2
❯ docker run -d --cpus="4" --memory="2G" --memory-reservation="1G" --shm-size="4gb" -e RAY_REDIS_ADDRESS=172.17.0.2:6379 -p 6023:6023 -p 8266:8266 -p 10001:10001 ray_head ray start --head --verbose --node-ip-address=0.0.0.0 --disable-usage-stats --port=6023 --include-dashboard=True --dashboard-host=0.0.0.0 --dashboard-port=8266 --ray-client-server-port=10001 --resources='{"num-cpus": 0, "num-gpus": 0}' --block
❯ docker ps
CONTAINER ID   IMAGE               COMMAND                  CREATED         STATUS         PORTS                                                                      NAMES
843b1de30422   ray_head            "/app/bin/entrypoint…"   4 seconds ago   Up 3 seconds   0.0.0.0:6023->6023/tcp, 0.0.0.0:8266->8266/tcp, 0.0.0.0:10001->10001/tcp   sharp_clarke
bf911dd78602   valkey/valkey:7.2   "docker-entrypoint.s…"   8 minutes ago   Up 8 minutes   0.0.0.0:6379->6379/tcp                                                     redis-server
❯ docker logs 843b1de30422
Running ray head in the ray_env virtualenv
2024-05-22 23:29:14,619 INFO usage_lib.py:443 -- Usage stats collection is disabled.
2024-05-22 23:29:14,619 INFO scripts.py:764 -- Local node IP: 0.0.0.0
2024-05-22 23:29:16,724 WARNING utils.py:580 -- Detecting docker specified CPUs. In previous versions of Ray, CPU detection in containers was incorrect. Please ensure that Ray has enough CPUs allocated. As a temporary workaround to revert to the prior behavior, set `RAY_USE_MULTIPROCESSING_CPU_COUNT=1` as an env var before starting Ray. Set the env var: `RAY_DISABLE_DOCKER_CPU_WARNING=1` to mute this warning.
[2024-05-22 23:29:16,761 W 9 9] (ray_init) global_state_accessor.cc:437: Retrying to get node with node ID bf9ba50a43bb4f2cfabc348d1b201f24e1fc173c0c4f5735ce31ed7d
[2024-05-22 23:29:17,764 W 9 9] (ray_init) global_state_accessor.cc:437: Retrying to get node with node ID bf9ba50a43bb4f2cfabc348d1b201f24e1fc173c0c4f5735ce31ed7d
2024-05-22 23:29:18,772 SUCC scripts.py:801 -- --------------------
2024-05-22 23:29:18,773 SUCC scripts.py:802 -- Ray runtime started.
2024-05-22 23:29:18,773 SUCC scripts.py:803 -- --------------------

Then I get stuck on ray.init:

❯ docker ps
CONTAINER ID   IMAGE               COMMAND                  CREATED          STATUS          PORTS                                                                      NAMES
843b1de30422   ray_head            "/app/bin/entrypoint…"   3 minutes ago    Up 3 minutes    0.0.0.0:6023->6023/tcp, 0.0.0.0:8266->8266/tcp, 0.0.0.0:10001->10001/tcp   sharp_clarke
bf911dd78602   valkey/valkey:7.2   "docker-entrypoint.s…"   11 minutes ago   Up 11 minutes   0.0.0.0:6379->6379/tcp                                                     redis-server
❯ docker inspect -f '{{range.NetworkSettings.Networks}}{{.IPAddress}}{{end}}' 843b1de30422
172.17.0.3
❯ export RAY_REDIS_ADDRESS=172.17.0.2:6379
❯ python
Python 3.11.9 | packaged by conda-forge | (main, Apr 19 2024, 18:34:54) [Clang 16.0.6 ] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import ray
>>> ray.init(address="172.17.0.3:6023", logging_level="debug")
2024-05-23 08:35:04,296 DEBUG worker.py:1491 -- Automatically increasing RLIMIT_NOFILE to max value of 9223372036854775807
2024-05-23 08:35:04,296 INFO worker.py:1564 -- Connecting to existing Ray cluster at address: 172.17.0.3:6023...
2024-05-23 08:35:16,344 DEBUG node.py:744 -- Connecting to GCS: Traceback (most recent call last):
  File "/Users/ekami/Programs/micromamba/envs/scribble/lib/python3.11/site-packages/ray/_private/node.py", line 728, in _init_gcs_client
    client = GcsClient(
             ^^^^^^^^^^
  File "python/ray/_raylet.pyx", line 2709, in ray._raylet.GcsClient.__cinit__
  File "python/ray/_raylet.pyx", line 2719, in ray._raylet.GcsClient._connect
  File "python/ray/_raylet.pyx", line 590, in ray._raylet.check_status
ray.exceptions.RaySystemError: System error: RPC Error message: failed to connect to all addresses; last error: UNAVAILABLE: ipv4:172.17.0.3:6023: Failed to connect to remote host: FD shutdown; RPC Error details:

Any idea what the issue might be? Thanks!