Environment:
- Ray version: 2.40.0
- Python version: 3.11.2
- OS: Win11
Hi everyone,
I’m currently running a Ray-based distributed training setup with one head node and two worker nodes (each on a different physical machine).
Each worker pulls policy parameters from a centralized ParamServer
actor running on the head node via a get_latest_policy()
method, like this:
policy_info = ray.get(param_server.get_policy.remote(client_id))
Here’s the strange behavior I’m observing:
- When Worker A is calling
get_latest_policy()
(communicating with the head node), - Worker B, which is not calling this method and is just running a local rollout (no known communication with the head node),
- begins to slow down noticeably — rollout FPS drops significantly, even though it’s not supposed to be blocked.
Some additional details:
- Both workers are on separate machines.
- The
ParamServer
actor is currently synchronous (usingdef
, notasync def
), and does not usemax_concurrency
orconcurrency_groups
. - I confirmed that Worker B is not calling
get_policy()
at the same time, nor should it be sending metrics or buffers in that moment. - However, the slowdown on Worker B seems to occur exactly when Worker A is communicating with the head node.
I have a few questions:
- Could there be some implicit communication or background thread (e.g., heartbeat, object reference syncing) that causes Worker B to be affected by Worker A’s RPC?
- Any methods to avoid the slowdown?