How can I specify the port number of health check?

I have two windows servers (192.168.1.11 and 192.168.1.12) and try to run a Ray Docker container (image tag = 2.35.0-py312-gpu) on each server.

Steps

  1. I run these two commands to start the Ray process. I confirm 192.168.1.11:8265 (the dashboard) shows the worker node (192.168.1.12).
# Run this in 192.168.1.11
$ ray start --head --dashboard-host=0.0.0.0
# Run this in 192.168.1.12
$ ray start --address=192.168.1.11:6379 --node-ip-address=192.168.1.12
  1. However, about 30 seconds after I complete Step 1, the status of the worker node becomes DEAD.

  2. I find gcs_server.out has these lines below. It seems that the head node fails to access 192.168.1.12:39091.

[2024-09-13 04:23:52,090 W 2925 2925] (gcs_server) gcs_health_check_manager.cc:108: Health check failed for node f7d09b9af5a7100e0376fad74db65ce7189372757f494c4525d6f147, remaining checks 4, status 4, response status 0, status message Deadline Exceeded, status details
[2024-09-13 04:23:57,115 W 2925 2925] (gcs_server) gcs_health_check_manager.cc:108: Health check failed for node f7d09b9af5a7100e0376fad74db65ce7189372757f494c4525d6f147, remaining checks 3, status 14, response status 0, status message failed to connect to all addresses; last error: UNKNOWN: ipv4:192.168.1.12:39091: Failed to connect to remote host: FD Shutdown, status details
[2024-09-13 04:24:00,115 W 2925 2925] (gcs_server) gcs_health_check_manager.cc:108: Health check failed for node f7d09b9af5a7100e0376fad74db65ce7189372757f494c4525d6f147, remaining checks 2, status 14, response status 0, status message failed to connect to all addresses; last error: UNKNOWN: ipv4:192.168.1.12:39091: Failed to connect to remote host: FD Shutdown, status details
[2024-09-13 04:24:03,116 W 2925 2925] (gcs_server) gcs_health_check_manager.cc:108: Health check failed for node f7d09b9af5a7100e0376fad74db65ce7189372757f494c4525d6f147, remaining checks 1, status 14, response status 0, status message failed to connect to all addresses; last error: UNKNOWN: ipv4:192.168.1.12:39091: Failed to connect to remote host: FD Shutdown, status details
[2024-09-13 04:24:06,116 W 2925 2925] (gcs_server) gcs_health_check_manager.cc:108: Health check failed for node f7d09b9af5a7100e0376fad74db65ce7189372757f494c4525d6f147, remaining checks 0, status 14, response status 0, status message failed to connect to all addresses; last error: UNKNOWN: ipv4:192.168.1.12:39091: Failed to connect to remote host: FD Shutdown, status details

Problem

The problem is the port number (39091 in 192.168.1.12:39091) changes every time and I don’t find any method to specify this port here (Configuring Ray — Ray 2.52.0), while I need to know which port to be used in advance in order to set up Windows Defender Firewall and Docker’s -p option.

Is there a good way to solve this problem?

IIUC you need to specify raylet port for 192.168.1.12? I think you can use --node-manager-port for it.

https://docs.ray.io/en/latest/ray-core/configure.html#ports-configurations

Does not seam to do the trick. healthchecks still fail for me and then the node dies..

If health checks are still failing and your Ray node dies, it’s likely due to missed heartbeats between the node and the GCS (Global Control Store). This can happen if the node is overloaded, the network is slow or unreliable, or there are firewall issues. The GCS marks a node as dead after missing a configurable number of heartbeats, which by default is 5 failures, but this can be adjusted via environment variables or Ray’s system config. However, persistent network or resource issues will still cause node removal regardless of these settings. See test_abnormal_termination and GcsHealthCheckManager for details.

If you have already increased the heartbeat thresholds and timeouts but the problem persists, the root cause is likely environmental (e.g., network/firewall, VM resource limits, or OS-level issues). For example, users have reported similar issues on WSL2/Windows 11 due to networking/firewall problems, and on clusters with overloaded nodes or slow interconnects. See this discussion and this issue for more troubleshooting tips. Would you like more detail on how to adjust the heartbeat parameters or debug your environment?

Sources:

Hint: Mention @RunLLM in the post for followups.