Frequent "node marked dead" errors after accumulating many succeeded tasks and dead actors in Ray Train

1、I’m encountering repeated warnings that nodes are being marked as dead due to missed heartbeats, even though the underlying infrastructure appears healthy:

The node with node id: 5d0da9d7f80164c7d61df8addb6a5d80a9d36ec3d8f988c0eef6b545 and address: 192.168.10.175 and node name: 192.168.10.175 has been marked dead because the detector has missed too many heartbeats from it. This can happen when a (1) raylet crashes unexpectedly (OOM, etc.)
(2) raylet has lagging heartbeats due to slow network or busy workload.

Image

Environment & Reproduction:

Network has been verified to be stable with no latency or packet loss.
The issue occurs in the following usage pattern:
Using ray.train to run a same training function (train_func) every 5 minutes.
Each invocation spawns a new worker on the Ray cluster to execute train_func.
Initially, everything runs smoothly for many iterations (hundreds of successful runs).
However, after the cluster accumulates significant metadata—e.g., hundreds of SUCCEEDED tasks, thousands of actors, and hundreds of dead nodes—the “node marked dead” error starts appearing.
Observed Behavior:

During job startup, Ray appears to attempt reconnecting to old IP addresses (e.g., 192.168.10.175) that belonged to workers from early runs and have long since been terminated and marked as dead.
This suggests that internal metadata (e.g., dead node or actor records) is not being properly cleaned up and may interfere with heartbeat monitoring or node discovery.
Mitigation Attempts:

I suspected that excessive caching of destroyed actors was contributing to the issue, so I set:
RAY_maximum_gcs_destroyed_actor_cached_count=1000 RAY_DASHBOARD_MAX_ACTORS_TO_CACHE=1000

Image

However, the problem persists even with these limits in place.

kuberay-operator: v1.4.0
ray: 2.40.0
python: 3.10

This issue is likely due to Ray’s internal metadata (dead nodes, actors, etc.) accumulating in the GCS, which can cause performance degradation and lead to nodes being mistakenly marked as dead, even if the infrastructure is healthy. Ray broadcasts node status changes (including dead nodes) to all live nodes, but dead node metadata is GCed periodically and should not impact cluster performance except for some GCS memory overhead. However, if the GCS or Raylet becomes overloaded (e.g., due to too many actors, tasks, or dead node records), it can cause heartbeat delays and false node death detection, as described in your logs and confirmed by others with similar patterns of long-running clusters and high metadata churn. Limiting actor/metadata cache size helps, but may not fully resolve the issue if the GCS is still overloaded or if dead node cleanup is not keeping up with cluster churn. There is no user-exposed command to force immediate dead node/actor GC.

If you want more detail, I can explain the relevant Ray internals and possible workarounds.

Sources:

Hint: Mention @RunLLM in the post for followups.

Is there anything that needs to be checked or what logs should be provided for identifying problems?