It seems likely that a high volume of “Dropping sync message with stale version” logs indicates excessive or redundant sync messages being sent between nodes. My understanding is that this can be caused by frequent state changes, network instability, or a misbehaving node repeatedly sending the same versioned message (node_state.cc). To debug, you could:
- Check for network issues or instability that might cause repeated message delivery.
- Review your cluster for nodes that are rapidly joining/leaving or restarting, as this can trigger sync floods.
- Enable more detailed logging (e.g., DEBUG level) for the syncer components to trace which node(s) are sending the redundant messages.
- Use Ray’s dashboard or CLI tools to inspect node health and recent events.
Would you like a step-by-step guide for any of these debugging approaches?
Sources:
- https://discuss.ray.io/t/some-confusion-about-raylet-out/14386
- https://github.com/ray-project/ray/issues/53727
Hint: Mention @RunLLM in the post for followups.