Unexpected slowdown in one worker when another worker is calling get_latest_policy()

Environment:

  • Ray version: 2.40.0
  • Python version: 3.11.2
  • OS: Win11

Hi everyone,

I’m currently running a Ray-based distributed training setup with one head node and two worker nodes (each on a different physical machine).

Each worker pulls policy parameters from a centralized ParamServer actor running on the head node via a get_latest_policy() method, like this:

policy_info = ray.get(param_server.get_policy.remote(client_id))

Here’s the strange behavior I’m observing:

  • When Worker A is calling get_latest_policy() (communicating with the head node),
  • Worker B, which is not calling this method and is just running a local rollout (no known communication with the head node),
  • begins to slow down noticeably — rollout FPS drops significantly, even though it’s not supposed to be blocked.

Some additional details:

  • Both workers are on separate machines.
  • The ParamServer actor is currently synchronous (using def, not async def), and does not use max_concurrency or concurrency_groups.
  • I confirmed that Worker B is not calling get_policy() at the same time, nor should it be sending metrics or buffers in that moment.
  • However, the slowdown on Worker B seems to occur exactly when Worker A is communicating with the head node.

I have a few questions:

  1. Could there be some implicit communication or background thread (e.g., heartbeat, object reference syncing) that causes Worker B to be affected by Worker A’s RPC?
  2. Any methods to avoid the slowdown?