[Serve] The `ray start --head --node-ip-address ip` is not working correctly in Docker. And it's not clear which ports to open

I’m trying to connect nodes deployed via docker to the master node. I am having a number of problems, which locally I was able to solve by setting network_mode: host to containers. But on the servers there is a firewall running and now I don’t understand how I can solve my problem. I can’t find any logs as to why the node connection was broken about 10 seconds after connection. Also, I can’t get the main node up with --node-ip-address x.x.x.x.x specifying. I get the error “RuntimeError: Failed to start GCS. Last 0 lines of error files:”

x.x.x.x - external ip

# docker-compose

version: "3.7"

services:
  node:
    image: <my-image>
    env_file:
      - .env
    environments:
      - RAY_num_heartbeats_timeout=300
      - RAY_CONFIG_CREATING_NODE=--head --metrics-export-port 9088 --dashboard-agent-listen-port 8266 --dashboard-agent-grpc-port 9266  --min-worker-port 10002 --max-worker-port 10010 --dashboard-host=0.0.0.0 --port 6378 --redis-shard-ports 6099 --dashboard-grpc-port 9265  --num-cpus=5 --node-ip-address x.x.x.x
    volumes:
      - ./data:/data
    runtime: nvidia
    restart: unless-stopped
    privileged: true
    # network_mode: "host"
    ports:
      - 8265:8265
      - 9122:9122
      - 6378:6378
      - 8099:8099
      - 9099:9099
      - 9265:9265
      - 6099:6099
      - 9266:9266
      - 8266:8266
      - 9088:9088
      - 10001-10010:10001-10010

Then with the same docker-compose I connect another node by changing only RAY_CONFIG_CREATING_NODE="--address x.x.x.x:6378 --dashboard-grpc-port 9265 --metrics-export-port 9088 --dashboard-agent-listen-port 8266 --dashboard-agent-grpc-port 9266 --min-worker-port 10002 --max-worker-port 10010 --num-cpus=5 --node-ip-address y.y.y.y"

After about 15 seconds, the connection between the nodes breaks.
|The node with node id: <id> and address: y.y.y.y and node name: y.y.y.y has been marked dead because the detector has missed too many heartbeats from it. This can happen when a |(1) raylet crashes unexpectedly (OOM, preempted node, etc.) | |---|---| |2|(2) raylet has lagging heartbeats due to slow network or busy workload.|

docker compose exec node serve start --proxy-location EveryNode \
        --http-host 0.0.0.0 --http-port 8099 --grpc-port 9099 \
        --grpc-servicer-functions dto.test_pb2_grpc.add_TestServicer_to_server

2023-12-20 19:14:33,573 INFO worker.py:1489 -- Connecting to existing Ray cluster at address: x.x.x.x:6378...
2023-12-20 19:14:33,591 INFO worker.py:1664 -- Connected to Ray cluster. View the dashboard at http://192.168.208.2:8265 
[2023-12-20 19:14:42,602 E 188 264] core_worker_process.cc:216: Failed to get the system config from raylet because it is dead. Worker will terminate. Status: GrpcUnavailable: RPC Error message: failed to connect to all addresses; last error: UNKNOWN: ipv4:y.y.y.y:40985: Failed to connect to remote host: Connection refused; RPC Error details:  .Please see `raylet.out` for more details.

I’m guessing it’s a port problem. Which ports should be opened? How do I remove the randomness and configure the exact port values?

Hi @psydok ,

Can you give a try to expose the port 6379?

Btw you can specify the port number in the start up cli by having “–port xxx”. (ref: Cluster Management CLI — Ray 2.9.0)

Thanks for the reply!
Yes, the main node is open on port 6378 (as you said via --port 6378). I can’t use 6379 because the server is busy on that port…

Hello, I’m attempting to manage a GPU cluster using ray. I’ve encountered the exact same issue as you have.
My master node and workers are deployed on different servers, with the worker nodes being deployed within Docker containers. The worker disconnects a few seconds after the worker nodes connect to the master node. The phenomenon I’m experiencing is exactly as you’ve described.
Do you now know how to resolve this now?

1 Like

I haven’t decided. But I think to try to add a rule in iptables for requests sent from docker-network.

Did you solve this problem? I’m facing the same problem

I also am facing this problem… is there an issue tracking this on the Ray GitHub?

Edit: I opened up ports from 10000:10099, since on a docker-less setup with 3 workers, the controller took over ports 10000->10054 or so. This solved my issue for small ray cluster of 3. (Also increased --max-worker-port 10099 ). Maybe the heartbeat evaluation is happening through another port which is limited by the small 10000:10010 range