Frequent "node marked dead" errors after accumulating many succeeded tasks and dead actors in Ray Train

This issue is likely due to Ray’s internal metadata (dead nodes, actors, etc.) accumulating in the GCS, which can cause performance degradation and lead to nodes being mistakenly marked as dead, even if the infrastructure is healthy. Ray broadcasts node status changes (including dead nodes) to all live nodes, but dead node metadata is GCed periodically and should not impact cluster performance except for some GCS memory overhead. However, if the GCS or Raylet becomes overloaded (e.g., due to too many actors, tasks, or dead node records), it can cause heartbeat delays and false node death detection, as described in your logs and confirmed by others with similar patterns of long-running clusters and high metadata churn. Limiting actor/metadata cache size helps, but may not fully resolve the issue if the GCS is still overloaded or if dead node cleanup is not keeping up with cluster churn. There is no user-exposed command to force immediate dead node/actor GC.

If you want more detail, I can explain the relevant Ray internals and possible workarounds.

Sources:

Hint: Mention @RunLLM in the post for followups.