RLLib external environment set up for turn based game

pedroviana · August 17, 2025, 5:27pm

1. Severity of the issue: (select one)
None: I’m just curious or want clarification.

1. Context & Goal

I’m building an RL agent for a complex turn based card game. The current setup:

Game engine sends game states (including possible actions) to a FastAPI server.
The FastAPI server uses a rule-based system to select actions and returns them to game engine.

Goal: Replace the rule-based system with an RLlib-trained policy.

2. Proposed Architecture

Based on env_connecting_to_rllib_w_tcp_client.py, I plan to:

Use FastAPI as the “Client” (replacing _dummy_client).
Maintain an RLModule instance in FastAPI for low-latency inference.
Use RLlib Trainer for centralized training.

3. Doubts & Questions

Doubt 1: RLModule Sync Workflow

I understand I need:

An RLModule instance in FastAPI (for inference).

An RLModule in RLlib Trainer (for training).

Sync Workflow:

FastAPI uses weights_seq_no=v1 for all inferences in a game.

At game end, it sends the trajectory (states/actions/rewards) to RLlib Trainer.

RLlib Trainer trains on the trajectory and pushes updated weights (weights_seq_no=v2) to FastAPI.

Is this correct?

Doubt 2: Weight Versioning with Parallel Games

Imagine two parallel games:

Both start with weights_seq_no=v1.
Game 1 finishes first → Trainer updates to v2 → pushes to FastAPI.
Game 2 (still running) uses v1 until completion.

Questions:

Is RLlib Trainer with the External Environment smart enough to handle off-policy trajectories (generated with older weights)?

4. Key Concerns

Scalability: Handling 100+ parallel games with weight versioning. The game engine and FastAPI server could potentially be handling multiple games in parallel and I’d like to use trajectories from all of them for training.

Request

Could you validate this architecture? Specifically:

Is the RLlink workflow correct?
How should I handle weight versioning for parallel off-policy trajectories?

Thank you for your incredible work!

christina · August 18, 2025, 8:29pm

Yep! I think so for the most part - you can read more about it here: ray/doc/source/rllib/external-envs.rst at releases/2.47.1 · ray-project/ray · GitHub

Regarding weight versioning and off-policy data: RLlib’s external environment setup (via RLlink) supports both on-policy and off-policy data collection. RLlib can train on these off-policy samples, though on-policy algorithms like PPO may see some degradation if the lag is large.

For high scalability (100+ games), batching trajectories and occasionally updating weights on the client is standard practice, and RLlib is designed to handle such asynchronous, parallel data ingestion (this discussion might be helpful even if it is a bit old).

Topic		Replies	Views
Trying to set up external RL environment and having trouble RLlib	14	1446	September 28, 2021
Training for turn-based sequential games RLlib	4	596	January 21, 2023
Best way to use rllib with player vs player games RLlib	3	802	March 15, 2021
RLlib's PolicyServer and external simulator as client RLlib	15	1753	April 12, 2021
External On-Policy Actions in PPO RLlib	3	653	June 23, 2021