Ray head connects only outside Docker

How severe does this issue affect your experience of using Ray?

  • High: It blocks me to complete my task.

I’m trying to setup ray in a Docker container but struggling on connecting to it when it’s running inside the container.
When I run:

ray start --head --disable-usage-stats --port=6026 --include-dashboard=True --dashboard-host=0.0.0.0 --dashboard-port=8265 --resources='{"num-cpus": 0, "num-gpus": 0}' --node-ip-address=127.0.0.1 --ray-client-server-port=10001 --dashboard-port=8266 --block

I get:

Usage stats collection is disabled.

Local node IP: 127.0.0.1

--------------------
Ray runtime started.
--------------------

Next steps

  To connect to this Ray cluster:
    import ray
    ray.init(_node_ip_address='127.0.0.1')

  To submit a Ray job using the Ray Jobs CLI:
    RAY_ADDRESS='http://127.0.0.1:8266' ray job submit --working-dir . -- python my_script.py

  See https://docs.ray.io/en/latest/cluster/running-applications/job-submission/index.html
  for more information on submitting Ray jobs to the Ray cluster.

  To terminate the Ray runtime, run
    ray stop

  To view the status of the cluster, use
    ray status

  To monitor and debug Ray, view the dashboard at
    127.0.0.1:8266

  If connection to the dashboard fails, check your firewall settings and network configuration.

--block
  This command will now block forever until terminated by a signal.
  Running subprocesses are monitored and a message will be printed if any of them terminate unexpectedly. Subprocesses exit with SIGTERM will be treated as graceful, thus NOT reported.

And when I connect to it:

Python 3.11.9 | packaged by conda-forge | (main, Apr 19 2024, 18:34:54) [Clang 16.0.6 ] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import ray
>>> ray.init(address="127.0.0.1:6026", logging_level="debug")
2024-05-20 08:02:18,164 DEBUG worker.py:1491 -- Automatically increasing RLIMIT_NOFILE to max value of 9223372036854775807
2024-05-20 08:02:18,165 INFO worker.py:1564 -- Connecting to existing Ray cluster at address: 127.0.0.1:6026...
2024-05-20 08:02:18,181 INFO worker.py:1740 -- Connected to Ray cluster. View the dashboard at http://127.0.0.1:8266
RayContext(dashboard_url='127.0.0.1:8266', python_version='3.11.9', ray_version='2.20.0', ray_commit='5708e75978413e46c703e44f43fd89769f3c148b')

Everything works fine.
However, if I use the following script to start my ray head inside my docker container (the script pretty much does the same thing):

#!/bin/bash

set -ef -o pipefail

echo "Running ray head in the ray_env virtualenv"
# num-cpus=0 to avoid having the head node schedule tasks on itself
mamba run -n ray_env ray start \
    --head \
    --verbose \
    --node-ip-address=0.0.0.0 \
    --disable-usage-stats \
    --port=6023 \
    --include-dashboard=True \
    --dashboard-host=0.0.0.0 \
    --dashboard-port=8265 \
    --ray-client-server-port=10001 \
    --resources='{"num-cpus": 0, "num-gpus": 0}' \
    --block

and then I start the container with docker compose like this:

networks:
  scribble:
    name: scribble
    driver: bridge

services:
  ray-head:
    image: ray_head
    restart: unless-stopped
    build:
      context: ./ray_head
      dockerfile: Dockerfile
    deploy:
      resources:
        limits:
          cpus: '4'
          memory: 2G
        reservations:
          memory: 1G
    shm_size: '4gb'
    environment:
      - RAY_HEAD=1
      - RAY_REDIS_ADDRESS=valkey:6379
    ports:
      - 6023:6023
      - "0.0.0.0:8265:8265"
      - 10001:10001
    depends_on:
      valkey:
        condition: service_healthy
    networks:
      - scribble

The ray head runs fine in the container as seen below:

❯ docker logs aa679e2e2cac
Running ray head in the ray_env virtualenv
Node IP address: 192.168.0.7
2024-05-19 22:53:47,744 INFO usage_lib.py:443 -- Usage stats collection is disabled.
2024-05-19 22:53:47,744 INFO scripts.py:764 -- Local node IP: 192.168.0.7
2024-05-19 22:53:48,759 INFO node.py:377 -- Could not retrieve session key from storage.
2024-05-19 22:53:50,415 WARNING utils.py:580 -- Detecting docker specified CPUs. In previous versions of Ray, CPU detection in containers was incorrect. Please ensure that Ray has enough CPUs allocated. As a temporary workaround to revert to the prior behavior, set `RAY_USE_MULTIPROCESSING_CPU_COUNT=1` as an env var before starting Ray. Set the env var: `RAY_DISABLE_DOCKER_CPU_WARNING=1` to mute this warning.
2024-05-19 22:53:50,535 SUCC scripts.py:801 -- --------------------
2024-05-19 22:53:50,535 SUCC scripts.py:802 -- Ray runtime started.
2024-05-19 22:53:50,535 SUCC scripts.py:803 -- --------------------
2024-05-19 22:53:50,535 INFO scripts.py:805 -- Next steps
2024-05-19 22:53:50,535 INFO scripts.py:808 -- To add another node to this Ray cluster, run
2024-05-19 22:53:50,535 INFO scripts.py:811 --   ray start --address='192.168.0.7:6023'
2024-05-19 22:53:50,535 INFO scripts.py:820 -- To connect to this Ray cluster:
2024-05-19 22:53:50,535 INFO scripts.py:822 -- import ray
2024-05-19 22:53:50,535 INFO scripts.py:823 -- ray.init()
2024-05-19 22:53:50,535 INFO scripts.py:835 -- To submit a Ray job using the Ray Jobs CLI:
2024-05-19 22:53:50,535 INFO scripts.py:836 --   RAY_ADDRESS='http://192.168.0.7:8265' ray job submit --working-dir . -- python my_script.py
2024-05-19 22:53:50,535 INFO scripts.py:845 -- See https://docs.ray.io/en/latest/cluster/running-applications/job-submission/index.html
2024-05-19 22:53:50,535 INFO scripts.py:849 -- for more information on submitting Ray jobs to the Ray cluster.
2024-05-19 22:53:50,535 INFO scripts.py:854 -- To terminate the Ray runtime, run
2024-05-19 22:53:50,535 INFO scripts.py:855 --   ray stop
2024-05-19 22:53:50,535 INFO scripts.py:858 -- To view the status of the cluster, use
2024-05-19 22:53:50,535 INFO scripts.py:859 --   ray status
2024-05-19 22:53:50,535 INFO scripts.py:863 -- To monitor and debug Ray, view the dashboard at
2024-05-19 22:53:50,535 INFO scripts.py:864 --   192.168.0.7:8265
2024-05-19 22:53:50,535 INFO scripts.py:871 -- If connection to the dashboard fails, check your firewall settings and network configuration.
2024-05-19 22:53:50,536 INFO scripts.py:972 -- --block
2024-05-19 22:53:50,536 INFO scripts.py:973 -- This command will now block forever until terminated by a signal.
2024-05-19 22:53:50,536 INFO scripts.py:976 -- Running subprocesses are monitored and a message will be printed if any of them terminate unexpectedly. Subprocesses exit with SIGTERM will be treated as graceful, thus NOT reported.

But whenever I try to connect to it, it doesn’t work:

❯ python
Python 3.11.9 | packaged by conda-forge | (main, Apr 19 2024, 18:34:54) [Clang 16.0.6 ] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import ray
>>> ray.init(address="192.168.32.6:6023", logging_level="debug")
2024-05-20 08:30:36,603 DEBUG worker.py:1491 -- Automatically increasing RLIMIT_NOFILE to max value of 9223372036854775807
2024-05-20 08:30:36,605 INFO worker.py:1564 -- Connecting to existing Ray cluster at address: 192.168.32.6:6023...
2024-05-20 08:30:48,649 DEBUG node.py:744 -- Connecting to GCS: Traceback (most recent call last):
  File "/Users/ekami/Programs/micromamba/envs/scribble/lib/python3.11/site-packages/ray/_private/node.py", line 728, in _init_gcs_client
    client = GcsClient(
             ^^^^^^^^^^
  File "python/ray/_raylet.pyx", line 2709, in ray._raylet.GcsClient.__cinit__
  File "python/ray/_raylet.pyx", line 2719, in ray._raylet.GcsClient._connect
  File "python/ray/_raylet.pyx", line 590, in ray._raylet.check_status
ray.exceptions.RaySystemError: System error: RPC Error message: failed to connect to all addresses; last error: UNAVAILABLE: ipv4:192.168.32.6:6023: Failed to connect to remote host: FD shutdown; RPC Error details:

Any idea what the issue might be? Thanks!