Repeated log: Dropping sync message with stale version

1. Severity of the issue: (select one)

Medium: Significantly affects my productivity but can find a workaround.

2. Environment:

  • Ray version: 2.52.1
  • Python version: 3.12
  • OS: linux ubuntu 22
  • Cloud/Infrastructure: none, on physical servers
  • Other libs/tools (if relevant):

3. What happened vs. what you expected:

  • Expected:
  • Actual:

I keep getting the following log: ( i am talking about thousands in a matter of minutes)

Dropping sync message with stale version. latest version: X, dropped message version: X

always the same version number in latest version and dropped message version

i am guessing it is coming from: NodeState::ConsumeSyncMessage

i saw it in a cluster of 16 ray workers

  1. is this valid behavior
  2. if not, what could cause it
  3. if its valid behavior, why is the log in info?

Thanks

Yes, this is valid behavior. The log “Dropping sync message with stale version. latest version: X, dropped message version: X” is emitted when Ray receives a sync message whose version is not newer than the version already stored for that node and message type. This deduplication is intentional to avoid redundant processing and network traffic, and the check is implemented in NodeState::ConsumeSyncMessage and related syncer logic, as shown in the code and logs (node_state.cc, ray_syncer_bidi_reactor_base.h).

The log is at INFO level to provide visibility into dropped (duplicate/stale) messages, but if you are seeing thousands per minute, it may indicate excessive redundant sync traffic in your cluster. This could be due to frequent state broadcasts, network retries, or misconfigured components repeatedly sending the same state. The log level and deduplication logic are by design (node_state.cc), but the volume may warrant investigation into your cluster’s sync message patterns.

Would you like more detail on how to reduce this log volume or debug the root cause?

Sources:

Hint: Mention @RunLLM in the post for followups.

@RunLLM

Yes, how should I debug the root cause for the large volume of this log?

It seems likely that a high volume of “Dropping sync message with stale version” logs indicates excessive or redundant sync messages being sent between nodes. My understanding is that this can be caused by frequent state changes, network instability, or a misbehaving node repeatedly sending the same versioned message (node_state.cc). To debug, you could:

  • Check for network issues or instability that might cause repeated message delivery.
  • Review your cluster for nodes that are rapidly joining/leaving or restarting, as this can trigger sync floods.
  • Enable more detailed logging (e.g., DEBUG level) for the syncer components to trace which node(s) are sending the redundant messages.
  • Use Ray’s dashboard or CLI tools to inspect node health and recent events.

Would you like a step-by-step guide for any of these debugging approaches?

Sources:

Hint: Mention @RunLLM in the post for followups.