Training RL agents on real games (single instance) - synchronous blocking issue

1. Severity of the issue: (select one)
Medium: Significantly affects my productivity but can find a workaround.

2. Environment:

  • Ray version: 2.53.0
  • Python version: 3.10.11
  • OS: Windows 10/11
  • Cloud/Infrastructure: Local machine (no cloud)
  • Other libs/tools (if relevant): gymnasium, mss (screen capture), pymem (memory reading), pydirectinput

3. What happened vs. what you expected:

  • Expected: Continuous gameplay during training, where the agent keeps interacting with the game without visible pauses.
  • Actual: Due to PPO’s synchronous architecture (Sampling → Training → Sampling), the game receives no inputs during the training phase, causing the character to stand still/die.

Context & Architecture

I’m training a PPO agent on SuperTux (a 2D platformer) using RLlib’s new API stack. The environment interacts with a live running game via:

  • Screen capture: MSS library captures game frames → grayscale → resized
  • Memory reading: pymem reads player position, coins, time, etc. directly from game memory
  • Input injection: pydirectinput sends keyboard inputs to the game window

Key Architecture Constraints

┌─────────────────┐     Screen Capture (MSS)      ┌──────────────────┐
│   SuperTux.exe  │ ─────────────────────────────▶│  SuperTuxEnv     │
│   (real game)   │                               │  (Gymnasium)     │
│                 │◀───────────────────────────── │                  │
└─────────────────┘     pydirectinput (keys)      └──────────────────┘
         │                                                  │
         │  Memory Read (pymem)                             │
         └──────────────────────────────────────────────────┘
  • Only ONE game instance can run at a time (no vectorized envs across processes)
  • The game runs in real-time - it doesn’t pause or wait for the agent
  • Using num_env_runners=0 (single process on Windows)

Environment Configuration (EnvCfg)

@dataclass
class EnvCfg:
    width: int = 64
    height: int = 64
    frame_stack: int = 4
    frame_skip: int = 4
    time_cap_s: float = 120.0
    
    # Stagnation detection (agent standing still)
    stagnation_dx_epsilon_px: float = 10.0
    stagnation_window_s: float = 4.0
    
    # Death detection via X-teleport (respawn = sudden X drop)
    enable_x_teleport_death_terminate: bool = True
    x_teleport_rel_drop: float = 0.5
    x_teleport_abs_drop_px: float = 200.0

Training Configuration (TrainCfg)

@dataclass
class TrainCfg:
    algo: str = 'PPO'
    n_steps: int = 2048              # rollout_fragment_length
    train_batch_size: int = 4096
    batch_size: int = 128            # sgd_minibatch_size
    n_epochs: int = 5                # num_sgd_iter
    learning_rate: float = 2.5e-4
    gamma: float = 0.99
    gae_lambda: float = 0.95
    ent_coef: float = 0.05

PPO Config (RLlib new API stack)

config = PPOConfig()
config = (
    config
    .api_stack(
        enable_rl_module_and_learner=True,
        enable_env_runner_and_connector_v2=True,
    )
    .environment("supertux")
    .framework("torch")
    .resources(num_gpus=0)
    .rl_module(
        rl_module_spec=RLModuleSpec(
            module_class=SuperTuxRLModule,  # Custom CNN+MLP for Dict obs
            model_config={...},
        ),
    )
)

# Single-runner stable defaults
config.num_env_runners = 0  # Windows single-process
config.rollout_fragment_length = 2048
config.train_batch_size = 4096
config.sgd_minibatch_size = 128
config.num_sgd_iter = 5

The Problem in Detail

The training loop is:

while timesteps < total_timesteps:
    result = algo.train()  # <-- BLOCKING: no env.step() during backprop!
    timesteps = result.get("num_env_steps_sampled_lifetime")

During algo.train():

  1. Rollout phase: env.step() is called 2048 times → Tux runs, game progresses ✓
  2. Training phase: SGD iterations on the batch → NO env.step() calls!
    • Game continues running in real-time
    • No inputs sent → Tux stands still
    • Enemies kill Tux, or he falls into pits
    • Timer runs out

This is fundamentally different from training on emulators (Atari/NES) where the game state is frozen when not stepping.

Symptoms

  • Tux runs for a few seconds (rollout)
  • Tux visibly stops (training phase)
  • CPU/GPU shows activity (backprop)
  • After training: Tux often already dead (x-teleported to start) or timeout
  • Episodes are artificially shortened by the “frozen input” periods

Questions

  1. What is the recommended approach for training on real-time games with a single instance?

    • Is APPO/IMPALA suitable even with just one environment?
    • Are there architectures that decouple sampling from training?
    • Any examples of RLlib used with real games (not emulators)?
  2. Would DreamerV3 have the same problem?

    • Since DreamerV3 trains on imagined trajectories from a world model, does it also block the environment during training?
    • Or is the “world model training” decoupled from “policy acting”?
  3. Possible Solutions I’m Considering

    • Separate inference process: Actor runs continuously, puts transitions in a queue, Learner trains asynchronously (like IMPALA)
    • Smaller rollouts: rollout_fragment_length=128 → shorter pauses, but more frequent
    • Model-based RL: DreamerV3? But unclear if it helps with this specific issue
    • Accept and mitigate: Add “death zone penalty” for repeated deaths at same location (already implemented), but doesn’t solve the root cause
  4. Is there a way to make PPO “semi-asynchronous”?

    • E.g., actor keeps stepping env in background thread while learner trains?
    • I understand this violates strict on-policy semantics, but for real-time games it might be acceptable

Current Mitigations (not solutions)

I’ve implemented several workarounds to reduce impact:

# Auto-pause when no input for too long
pause_after_idle_s: float = 1.0

# Death penalty with zone tracking (higher penalty for repeated deaths at same spot)
enable_death_zone_penalty: bool = True
death_zone_radius_px: float = 150.0

# Milestone rewards (credit assignment even if agent dies later)
enable_milestone_rewards: bool = True
milestone_interval_px: float = 500.0

But these don’t solve the fundamental issue of the game continuing while training blocks.


Any guidance from the community would be greatly appreciated!

PPO in RLlib is fundamentally synchronous: during the training phase, no actions are sent to the environment, so your real-time game continues running without agent input, causing the agent to “freeze” and die. This is a known limitation for real-time, non-emulated environments where the environment cannot be paused between rollouts and training steps. PPO’s architecture (Sampling → Training → Sampling) is not designed for continuous, asynchronous interaction with a live game loop, and there is no built-in way to make PPO “semi-asynchronous” in RLlib’s current implementation. Reducing rollout_fragment_length only shortens the freeze, but does not eliminate it. According to RLlib PPO documentation, PPO is synchronous by design.

Algorithms like IMPALA and APPO are designed for asynchronous sampling and training, decoupling environment stepping from learning. However, they are intended for distributed setups with multiple environments and may not provide benefit with only a single environment instance, as in your case. DreamerV3, while model-based and able to train on imagined trajectories, still requires real environment interaction for world model updates and is not guaranteed to solve the real-time blocking issue in a single-instance, real-game setup. There are no official RLlib examples for real-time, single-instance games (as opposed to emulators) that avoid this problem. The recommended approach is to implement a custom actor-learner architecture (as in IMPALA) where the actor process steps the environment and queues transitions, while the learner trains asynchronously—this requires significant custom engineering outside of PPO’s standard workflow. See RLlib IMPALA/APPO documentation for more on asynchronous architectures.

Would you like a step-by-step breakdown of how to implement a custom asynchronous actor-learner setup for your use case?

Sources:

Hint: Mention @RunLLM in the post for followups.

@RunLLM

Thank you for the detailed explanation! Before implementing a full custom actor-learner architecture, I’d like to propose a simpler solution and get feedback on whether this approach is viable:

Proposed Solution: Pause the Game During Training Phase

Since SuperTux (and many other games) support a pause function, I could:

  1. on_sample_end callback: Send the pause key (P) → Game freezes
  2. Training phase: Backprop runs while game is paused (no time passes in-game)
  3. on_train_result callback: Send unpause key (P) → Game resumes
  4. Next rollout: Agent continues from exactly where it left off

Implementation (RLlib Callbacks)

class TrainingPauseCallback(RLlibCallback):
    """Pauses the game during the training phase."""
    
    def __init__(self):
        super().__init__()
        self._is_paused = False
    
    def on_sample_end(self, *, env_runner, samples, **kwargs):
        """After rollout, BEFORE training -> PAUSE game."""
        if not self._is_paused:
            pydirectinput.press("p")  # Pause key
            self._is_paused = True
            time.sleep(0.1)
    
    def on_train_result(self, *, algorithm, result, **kwargs):
        """After training, BEFORE next rollout -> UNPAUSE game."""
        if self._is_paused:
            pydirectinput.press("p")  # Unpause key
            self._is_paused = False
            time.sleep(0.1)

Flow Diagram

[Rollout: Agent plays, game runs] 
    → on_sample_end → Press P → Game PAUSED
    → [Training: SGD iterations, game frozen]
    → on_train_result → Press P → Game UNPAUSED
    → [Next Rollout: Agent continues exactly where it stopped]
    → ...

Questions

  1. Is this approach valid?

    • Does RLlib guarantee that on_sample_end fires after all env.step() calls but before any training computation?
    • Does on_train_result fire after training but before the next sampling begins?
  2. Are there timing edge cases?

    • Could there be a race condition where the game receives the pause key too late (after training already started)?
    • Is there a more precise hook point, like on_before_train or similar?
  3. Any simpler built-in solution?

    • Is there an RLlib config option to insert a callback between sampling and training that I missed?
    • Could sample_async=True or similar help in single-env scenarios?
  4. Alternative: Smaller rollouts + accept freezes?

    • If the pause approach doesn’t work reliably, would you recommend just using very small rollout_fragment_length (e.g., 128) to minimize freeze duration, even if it means more frequent (but shorter) pauses?

This seems much simpler than implementing a full actor-learner architecture. The game already has a pause mechanism, so leveraging it feels like the path of least resistance.

Looking forward to your feedback!

It could be possible that your “pause during training” approach is viable, since RLlib’s on_sample_end callback is called after all environment steps (rollout) and before training, and on_train_result is called after training and before the next sampling phase. This means your callback should, in theory, pause the game before training starts and unpause it before the next rollout, as you intend. However, though I’m not entirely certain, there might be timing edge cases—such as slight delays in keypress delivery or thread scheduling—that could cause the pause to occur just after training begins, but RLlib does not provide a more granular “on_before_train” hook in the documented callback API. There is also no built-in config option for inserting a callback exactly between sampling and training beyond what you are already using. The sample_async=True option is not documented as a solution for this scenario and is generally not recommended for most environments, especially single-env setups, as per RLlib advanced API docs.

If the pause approach proves unreliable, it might be that using a smaller rollout_fragment_length is the simplest workaround, as you suggested, even though it increases the frequency of pauses. There does not appear to be a simpler built-in solution or a more precise callback for your use case in the current RLlib API. For reference, see the callback structure in RLlib’s callbacks.py and the discussion of rollout/training phases in RLlib advanced API documentation.

Would you like a step-by-step breakdown of how to implement and test this pause/unpause callback approach, or more detail on possible edge cases?

Sources:

Hint: Mention @RunLLM in the post for followups.

@christina

Can you tell me how I can get around this problem?