I’m trying to connect nodes deployed via docker to the master node. I am having a number of problems, which locally I was able to solve by setting network_mode: host
to containers. But on the servers there is a firewall running and now I don’t understand how I can solve my problem. I can’t find any logs as to why the node connection was broken about 10 seconds after connection. Also, I can’t get the main node up with --node-ip-address x.x.x.x.x
specifying. I get the error “RuntimeError: Failed to start GCS. Last 0 lines of error files:”
x.x.x.x
- external ip
# docker-compose
version: "3.7"
services:
node:
image: <my-image>
env_file:
- .env
environments:
- RAY_num_heartbeats_timeout=300
- RAY_CONFIG_CREATING_NODE=--head --metrics-export-port 9088 --dashboard-agent-listen-port 8266 --dashboard-agent-grpc-port 9266 --min-worker-port 10002 --max-worker-port 10010 --dashboard-host=0.0.0.0 --port 6378 --redis-shard-ports 6099 --dashboard-grpc-port 9265 --num-cpus=5 --node-ip-address x.x.x.x
volumes:
- ./data:/data
runtime: nvidia
restart: unless-stopped
privileged: true
# network_mode: "host"
ports:
- 8265:8265
- 9122:9122
- 6378:6378
- 8099:8099
- 9099:9099
- 9265:9265
- 6099:6099
- 9266:9266
- 8266:8266
- 9088:9088
- 10001-10010:10001-10010
Then with the same docker-compose I connect another node by changing only RAY_CONFIG_CREATING_NODE="--address x.x.x.x:6378 --dashboard-grpc-port 9265 --metrics-export-port 9088 --dashboard-agent-listen-port 8266 --dashboard-agent-grpc-port 9266 --min-worker-port 10002 --max-worker-port 10010 --num-cpus=5 --node-ip-address y.y.y.y"
After about 15 seconds, the connection between the nodes breaks.
|The node with node id: <id> and address: y.y.y.y and node name: y.y.y.y has been marked dead because the detector has missed too many heartbeats from it. This can happen when a |(1) raylet crashes unexpectedly (OOM, preempted node, etc.) | |---|---| |2|(2) raylet has lagging heartbeats due to slow network or busy workload.|
docker compose exec node serve start --proxy-location EveryNode \
--http-host 0.0.0.0 --http-port 8099 --grpc-port 9099 \
--grpc-servicer-functions dto.test_pb2_grpc.add_TestServicer_to_server
2023-12-20 19:14:33,573 INFO worker.py:1489 -- Connecting to existing Ray cluster at address: x.x.x.x:6378...
2023-12-20 19:14:33,591 INFO worker.py:1664 -- Connected to Ray cluster. View the dashboard at http://192.168.208.2:8265
[2023-12-20 19:14:42,602 E 188 264] core_worker_process.cc:216: Failed to get the system config from raylet because it is dead. Worker will terminate. Status: GrpcUnavailable: RPC Error message: failed to connect to all addresses; last error: UNKNOWN: ipv4:y.y.y.y:40985: Failed to connect to remote host: Connection refused; RPC Error details: .Please see `raylet.out` for more details.
I’m guessing it’s a port problem. Which ports should be opened? How do I remove the randomness and configure the exact port values?