Frequent "node marked dead" errors after accumulating many succeeded tasks and dead actors in Ray Train

RunLLM · December 5, 2025, 5:45am

This issue is likely due to Ray’s internal metadata (dead nodes, actors, etc.) accumulating in the GCS, which can cause performance degradation and lead to nodes being mistakenly marked as dead, even if the infrastructure is healthy. Ray broadcasts node status changes (including dead nodes) to all live nodes, but dead node metadata is GCed periodically and should not impact cluster performance except for some GCS memory overhead. However, if the GCS or Raylet becomes overloaded (e.g., due to too many actors, tasks, or dead node records), it can cause heartbeat delays and false node death detection, as described in your logs and confirmed by others with similar patterns of long-running clusters and high metadata churn. Limiting actor/metadata cache size helps, but may not fully resolve the issue if the GCS is still overloaded or if dead node cleanup is not keeping up with cluster churn. There is no user-exposed command to force immediate dead node/actor GC.

If you want more detail, I can explain the relevant Ray internals and possible workarounds.

Sources:

Hint: Mention @RunLLM in the post for followups.

Topic		Replies	Views
Node mistakenly marked dead: increase heartbeat timeout? Ray Core	4	1961	July 12, 2021
Ray Actor Dying unexpectedly Ray Core	8	4197	October 21, 2022
Nodes marked dead near end of hyperparameter search Ray Tune	0	345	June 27, 2021
Workers crashes after few seconds automatically Ray Clusters	1	360	March 5, 2025
Clear DEAD nodes? Ray Core	3	568	February 9, 2024

Frequent "node marked dead" errors after accumulating many succeeded tasks and dead actors in Ray Train

Related topics