I have two windows servers (192.168.1.11
and 192.168.1.12
) and try to run a Ray Docker container (image tag = 2.35.0-py312-gpu
) on each server.
Steps
- I run these two commands to start the Ray process. I confirm 192.168.1.11:8265 (the dashboard) shows the worker node (192.168.1.12).
# Run this in 192.168.1.11
$ ray start --head --dashboard-host=0.0.0.0
# Run this in 192.168.1.12
$ ray start --address=192.168.1.11:6379 --node-ip-address=192.168.1.12
-
However, about 30 seconds after I complete Step 1, the status of the worker node becomes
DEAD
. -
I find
gcs_server.out
has these lines below. It seems that the head node fails to access192.168.1.12:39091
.
[2024-09-13 04:23:52,090 W 2925 2925] (gcs_server) gcs_health_check_manager.cc:108: Health check failed for node f7d09b9af5a7100e0376fad74db65ce7189372757f494c4525d6f147, remaining checks 4, status 4, response status 0, status message Deadline Exceeded, status details
[2024-09-13 04:23:57,115 W 2925 2925] (gcs_server) gcs_health_check_manager.cc:108: Health check failed for node f7d09b9af5a7100e0376fad74db65ce7189372757f494c4525d6f147, remaining checks 3, status 14, response status 0, status message failed to connect to all addresses; last error: UNKNOWN: ipv4:192.168.1.12:39091: Failed to connect to remote host: FD Shutdown, status details
[2024-09-13 04:24:00,115 W 2925 2925] (gcs_server) gcs_health_check_manager.cc:108: Health check failed for node f7d09b9af5a7100e0376fad74db65ce7189372757f494c4525d6f147, remaining checks 2, status 14, response status 0, status message failed to connect to all addresses; last error: UNKNOWN: ipv4:192.168.1.12:39091: Failed to connect to remote host: FD Shutdown, status details
[2024-09-13 04:24:03,116 W 2925 2925] (gcs_server) gcs_health_check_manager.cc:108: Health check failed for node f7d09b9af5a7100e0376fad74db65ce7189372757f494c4525d6f147, remaining checks 1, status 14, response status 0, status message failed to connect to all addresses; last error: UNKNOWN: ipv4:192.168.1.12:39091: Failed to connect to remote host: FD Shutdown, status details
[2024-09-13 04:24:06,116 W 2925 2925] (gcs_server) gcs_health_check_manager.cc:108: Health check failed for node f7d09b9af5a7100e0376fad74db65ce7189372757f494c4525d6f147, remaining checks 0, status 14, response status 0, status message failed to connect to all addresses; last error: UNKNOWN: ipv4:192.168.1.12:39091: Failed to connect to remote host: FD Shutdown, status details
Problem
The problem is the port number (39091
in 192.168.1.12:39091
) changes every time and I don’t find any method to specify this port here (Configuring Ray — Ray 2.38.0), while I need to know which port to be used in advance in order to set up Windows Defender Firewall and Docker’s -p
option.
Is there a good way to solve this problem?