RLLib external environment set up for turn based game

1. Severity of the issue: (select one)
None: I’m just curious or want clarification.

1. Context & Goal

I’m building an RL agent for a complex turn based card game. The current setup:

  • Game engine sends game states (including possible actions) to a FastAPI server.

  • The FastAPI server uses a rule-based system to select actions and returns them to game engine.

Goal: Replace the rule-based system with an RLlib-trained policy.


2. Proposed Architecture

Based on env_connecting_to_rllib_w_tcp_client.py, I plan to:

  1. Use FastAPI as the “Client” (replacing _dummy_client).

  2. Maintain an RLModule instance in FastAPI for low-latency inference.

  3. Use RLlib Trainer for centralized training.


3. Doubts & Questions

Doubt 1: RLModule Sync Workflow

I understand I need:

  • An RLModule instance in FastAPI (for inference).

  • An RLModule in RLlib Trainer (for training).

Sync Workflow:

  1. FastAPI uses weights_seq_no=v1 for all inferences in a game.

  2. At game end, it sends the trajectory (states/actions/rewards) to RLlib Trainer.

  3. RLlib Trainer trains on the trajectory and pushes updated weights (weights_seq_no=v2) to FastAPI.

Is this correct?

Doubt 2: Weight Versioning with Parallel Games

Imagine two parallel games:

  • Both start with weights_seq_no=v1.

  • Game 1 finishes first → Trainer updates to v2 → pushes to FastAPI.

  • Game 2 (still running) uses v1 until completion.

Questions:

  1. Is RLlib Trainer with the External Environment smart enough to handle off-policy trajectories (generated with older weights)?

4. Key Concerns

  • Scalability: Handling 100+ parallel games with weight versioning. The game engine and FastAPI server could potentially be handling multiple games in parallel and I’d like to use trajectories from all of them for training.

Request

Could you validate this architecture? Specifically:

  1. Is the RLlink workflow correct?

  2. How should I handle weight versioning for parallel off-policy trajectories?

Thank you for your incredible work! :folded_hands:

Yep! I think so for the most part - you can read more about it here: ray/doc/source/rllib/external-envs.rst at releases/2.47.1 · ray-project/ray · GitHub

Regarding weight versioning and off-policy data: RLlib’s external environment setup (via RLlink) supports both on-policy and off-policy data collection. RLlib can train on these off-policy samples, though on-policy algorithms like PPO may see some degradation if the lag is large.

For high scalability (100+ games), batching trajectories and occasionally updating weights on the client is standard practice, and RLlib is designed to handle such asynchronous, parallel data ingestion (this discussion might be helpful even if it is a bit old).