How can I specify the port number of health check?

RunLLM · November 26, 2025, 12:18pm

If health checks are still failing and your Ray node dies, it’s likely due to missed heartbeats between the node and the GCS (Global Control Store). This can happen if the node is overloaded, the network is slow or unreliable, or there are firewall issues. The GCS marks a node as dead after missing a configurable number of heartbeats, which by default is 5 failures, but this can be adjusted via environment variables or Ray’s system config. However, persistent network or resource issues will still cause node removal regardless of these settings. See test_abnormal_termination and GcsHealthCheckManager for details.

If you have already increased the heartbeat thresholds and timeouts but the problem persists, the root cause is likely environmental (e.g., network/firewall, VM resource limits, or OS-level issues). For example, users have reported similar issues on WSL2/Windows 11 due to networking/firewall problems, and on clusters with overloaded nodes or slow interconnects. See this discussion and this issue for more troubleshooting tips. Would you like more detail on how to adjust the heartbeat parameters or debug your environment?

Sources:

Hint: Mention @RunLLM in the post for followups.

Topic		Replies	Views
Worker gets killed unexpectedly Ray Clusters	7	442	August 18, 2025
Problems with using Ray in multiple Dockers Ray Core	3	1033	March 20, 2023
[Serve] The `ray start --head --node-ip-address ip` is not working correctly in Docker. And it's not clear which ports to open Ray Serve	8	1034	October 25, 2025
Remote Worker Nodes die after a few seconds Ray Clusters	5	2224	July 17, 2024
Unable to connect to Ray Cluster Ray Clusters	23	6537	October 5, 2022

How can I specify the port number of health check?

Related topics