Yep! I think so for the most part - you can read more about it here: ray/doc/source/rllib/external-envs.rst at releases/2.47.1 · ray-project/ray · GitHub
Regarding weight versioning and off-policy data: RLlib’s external environment setup (via RLlink) supports both on-policy and off-policy data collection. RLlib can train on these off-policy samples, though on-policy algorithms like PPO may see some degradation if the lag is large.
For high scalability (100+ games), batching trajectories and occasionally updating weights on the client is standard practice, and RLlib is designed to handle such asynchronous, parallel data ingestion (this discussion might be helpful even if it is a bit old).