If health checks are still failing and your Ray node dies, it’s likely due to missed heartbeats between the node and the GCS (Global Control Store). This can happen if the node is overloaded, the network is slow or unreliable, or there are firewall issues. The GCS marks a node as dead after missing a configurable number of heartbeats, which by default is 5 failures, but this can be adjusted via environment variables or Ray’s system config. However, persistent network or resource issues will still cause node removal regardless of these settings. See test_abnormal_termination and GcsHealthCheckManager for details.
If you have already increased the heartbeat thresholds and timeouts but the problem persists, the root cause is likely environmental (e.g., network/firewall, VM resource limits, or OS-level issues). For example, users have reported similar issues on WSL2/Windows 11 due to networking/firewall problems, and on clusters with overloaded nodes or slow interconnects. See this discussion and this issue for more troubleshooting tips. Would you like more detail on how to adjust the heartbeat parameters or debug your environment?
Sources:
- test_abnormal_termination
- GcsHealthCheckManager
- health check failed errors (Ray Discuss)
- Exiting because this node manager has mistakenly been marked as dead by the GCS
Hint: Mention @RunLLM in the post for followups.