1. Severity of the issue: (select one)
Medium: Significantly affects my productivity but can find a workaround.
2. Environment:
- Ray version: 2.53.0
- Python version: 3.10.11
- OS: Windows 10/11
- Cloud/Infrastructure: Local machine (no cloud)
- Other libs/tools (if relevant): gymnasium, mss (screen capture), pymem (memory reading), pydirectinput
3. What happened vs. what you expected:
- Expected: Continuous gameplay during training, where the agent keeps interacting with the game without visible pauses.
- Actual: Due to PPO’s synchronous architecture (Sampling → Training → Sampling), the game receives no inputs during the training phase, causing the character to stand still/die.
Context & Architecture
I’m training a PPO agent on SuperTux (a 2D platformer) using RLlib’s new API stack. The environment interacts with a live running game via:
- Screen capture: MSS library captures game frames → grayscale → resized
- Memory reading: pymem reads player position, coins, time, etc. directly from game memory
- Input injection: pydirectinput sends keyboard inputs to the game window
Key Architecture Constraints
┌─────────────────┐ Screen Capture (MSS) ┌──────────────────┐
│ SuperTux.exe │ ─────────────────────────────▶│ SuperTuxEnv │
│ (real game) │ │ (Gymnasium) │
│ │◀───────────────────────────── │ │
└─────────────────┘ pydirectinput (keys) └──────────────────┘
│ │
│ Memory Read (pymem) │
└──────────────────────────────────────────────────┘
- Only ONE game instance can run at a time (no vectorized envs across processes)
- The game runs in real-time - it doesn’t pause or wait for the agent
- Using
num_env_runners=0(single process on Windows)
Environment Configuration (EnvCfg)
@dataclass
class EnvCfg:
width: int = 64
height: int = 64
frame_stack: int = 4
frame_skip: int = 4
time_cap_s: float = 120.0
# Stagnation detection (agent standing still)
stagnation_dx_epsilon_px: float = 10.0
stagnation_window_s: float = 4.0
# Death detection via X-teleport (respawn = sudden X drop)
enable_x_teleport_death_terminate: bool = True
x_teleport_rel_drop: float = 0.5
x_teleport_abs_drop_px: float = 200.0
Training Configuration (TrainCfg)
@dataclass
class TrainCfg:
algo: str = 'PPO'
n_steps: int = 2048 # rollout_fragment_length
train_batch_size: int = 4096
batch_size: int = 128 # sgd_minibatch_size
n_epochs: int = 5 # num_sgd_iter
learning_rate: float = 2.5e-4
gamma: float = 0.99
gae_lambda: float = 0.95
ent_coef: float = 0.05
PPO Config (RLlib new API stack)
config = PPOConfig()
config = (
config
.api_stack(
enable_rl_module_and_learner=True,
enable_env_runner_and_connector_v2=True,
)
.environment("supertux")
.framework("torch")
.resources(num_gpus=0)
.rl_module(
rl_module_spec=RLModuleSpec(
module_class=SuperTuxRLModule, # Custom CNN+MLP for Dict obs
model_config={...},
),
)
)
# Single-runner stable defaults
config.num_env_runners = 0 # Windows single-process
config.rollout_fragment_length = 2048
config.train_batch_size = 4096
config.sgd_minibatch_size = 128
config.num_sgd_iter = 5
The Problem in Detail
The training loop is:
while timesteps < total_timesteps:
result = algo.train() # <-- BLOCKING: no env.step() during backprop!
timesteps = result.get("num_env_steps_sampled_lifetime")
During algo.train():
- Rollout phase:
env.step()is called 2048 times → Tux runs, game progresses ✓ - Training phase: SGD iterations on the batch → NO
env.step()calls!- Game continues running in real-time
- No inputs sent → Tux stands still
- Enemies kill Tux, or he falls into pits
- Timer runs out
This is fundamentally different from training on emulators (Atari/NES) where the game state is frozen when not stepping.
Symptoms
- Tux runs for a few seconds (rollout)
- Tux visibly stops (training phase)
- CPU/GPU shows activity (backprop)
- After training: Tux often already dead (x-teleported to start) or timeout
- Episodes are artificially shortened by the “frozen input” periods
Questions
-
What is the recommended approach for training on real-time games with a single instance?
- Is APPO/IMPALA suitable even with just one environment?
- Are there architectures that decouple sampling from training?
- Any examples of RLlib used with real games (not emulators)?
-
Would DreamerV3 have the same problem?
- Since DreamerV3 trains on imagined trajectories from a world model, does it also block the environment during training?
- Or is the “world model training” decoupled from “policy acting”?
-
Possible Solutions I’m Considering
- Separate inference process: Actor runs continuously, puts transitions in a queue, Learner trains asynchronously (like IMPALA)
- Smaller rollouts:
rollout_fragment_length=128→ shorter pauses, but more frequent - Model-based RL: DreamerV3? But unclear if it helps with this specific issue
- Accept and mitigate: Add “death zone penalty” for repeated deaths at same location (already implemented), but doesn’t solve the root cause
-
Is there a way to make PPO “semi-asynchronous”?
- E.g., actor keeps stepping env in background thread while learner trains?
- I understand this violates strict on-policy semantics, but for real-time games it might be acceptable
Current Mitigations (not solutions)
I’ve implemented several workarounds to reduce impact:
# Auto-pause when no input for too long
pause_after_idle_s: float = 1.0
# Death penalty with zone tracking (higher penalty for repeated deaths at same spot)
enable_death_zone_penalty: bool = True
death_zone_radius_px: float = 150.0
# Milestone rewards (credit assignment even if agent dies later)
enable_milestone_rewards: bool = True
milestone_interval_px: float = 500.0
But these don’t solve the fundamental issue of the game continuing while training blocks.
Any guidance from the community would be greatly appreciated!