Attari wrapper doesn't properly reset _elpased_steps for TimeLimitWrapper

**1. Severity of the issue:
[x ] Medium: Significantly affects my productivity but can find a workaround. (recreating the algo with a fresh env after collecting some episodes should solve this problem for offline data creation I assume, but I am unsure if using it causes further problems, e.g. in rllibs algos)

2. Environment:

  • Ray version: 2.44.1
  • Python version: 3.12.9
  • OS: Linux
  • Cloud/Infrastructure: Slurm Cluster
  • Other libs/tools (if relevant): gymnasium 1.0.0, ale_py: 0.10.2

3. What happened vs. what you expected:

  • Expected: “wrap_atari_for_new_api_stack” wrapper to reset elapsed steps when reset() is called such that TimeLimitWrappers function normally.

  • Actual: It seems to not reset the elapsed steps when reset() is called, leading to episodes being truncated indefinitely once the max_episode_steps of a TimeLimitWrapper (default has max_episode_steps=108000) has been reached. So one of the Wrappers that the “wrap_atari_for_new_api_stack” adds seems to be broken (code below lists them all). What I ask myself is if this problem also manifests itself during training of rllib algorithms with envs created by this wrapper, or are envs regulary recreated for workers, internally, such that the max_episode_steps are never/rarely reached?

I used the “ale_py:ALE/Pong-v5” env with the “wrap_atari_for_new_api_stack” from ray.rllib.env.wrappers.atari_wrappers to collect offline SingleAgentEpisodes from a pretrained ppo algorithm. However, I ran into large problems of envs truncating after one step after some time, which I though originated from me restoring the checkpoint of the trained agent in a wrong way, and therefore somehow having an incompatible env creator. After loosing several days at intensively testing and questioning my sanity I have come to the conclusion that the error did originate from something else… the “wrap_atari_for_new_api_stack” not resetting the _elapsed steps internally correctly at reset(), which manifested very early for me because I gave the arg “max_episode_steps=10000” to the gym env, creating a second TimeLimitWrapper (the first default one from the atari wrapper has len 108000, therefore such problems wont happen early). Maybe there is even another underlying mechanismn that prohibits the max_episode steps from beeing reached, I didn’t test that because I didn’t think it was likely. When the 10000 steps were reached after a few episodes, each following episode would terminate instandly, and I would end up with SingleAgent episodes of len=1 and return=0.

The code below shows that with or without using an additionaly “max_episode_steps=n” kwarg in the gym env, the elapsed steps of the env don’t seem to be properly reset. However, while I saw that it leads to early truncation in the env with two TimeLimitWrappers, I can only assume that it does the same for the default TimeLimitWrapper that “wrap_atari_for_new_api_stack” adds but I didn’t test that. As I thought that maybe two TimeLimitWrappers are the problem, the third test creats an env with two TimeLimit wrappers and there you see that the elpased steps are properly reset every time reset() is called. Note that the elpased step grow during reset(), too, which I would assume is cause by “NoopResetEnv”.

#!/usr/bin/env python3
"""
Debug script to understand TimeLimit wrapper reset behavior.
"""

import gymnasium as gym
from ray.rllib.env.wrappers.atari_wrappers import wrap_atari_for_new_api_stack
from gymnasium.wrappers import TimeLimit

def analyze_wrapper_chain(env, name="Environment"):
    """Analyze the wrapper chain and TimeLimit wrappers."""
    print(f"\n=== {name} Wrapper Chain Analysis ===")
    current = env
    depth = 0
    timelimit_wrappers = []
    
    while hasattr(current, 'env'):
        wrapper_type = type(current).__name__
        print(f"Depth {depth}: {wrapper_type}")
        
        if 'TimeLimit' in wrapper_type:
            timelimit_wrappers.append(current)
            print(f"  → TimeLimit wrapper found!")
            if hasattr(current, '_max_episode_steps'):
                print(f"    _max_episode_steps: {current._max_episode_steps}")
            if hasattr(current, '_elapsed_steps'):
                print(f"    _elapsed_steps: {current._elapsed_steps}")
        
        current = current.env
        depth += 1
    
    print(f"Final env at depth {depth}: {type(current).__name__}")
    return timelimit_wrappers

def test_reset_behavior():
    """Test TimeLimit wrapper reset behavior with different configurations."""
    
    print("🔬 TESTING TIMELIMIT WRAPPER RESET BEHAVIOR")
    
    # Test 1: Environment without max_episode_steps
    print("\n" + "="*60)
    print("TEST 1: Environment WITHOUT max_episode_steps")
    print("="*60)
    
    env1_config = {
        "frameskip": 1,
        "full_action_space": False,
        "repeat_action_probability": 0.0,
        "difficulty": 0,
    }
    
    env1 = wrap_atari_for_new_api_stack(
        gym.make("ale_py:ALE/Pong-v5", **env1_config), 
        frameskip=4, 
        framestack=3
    )
    
    wrappers1 = analyze_wrapper_chain(env1, "ENV1 (no max_episode_steps)")
    
    # Test 2: Environment with max_episode_steps
    print("\n" + "="*60)
    print("TEST 2: Environment WITH max_episode_steps")
    print("="*60)
    
    env2_config = {
        "frameskip": 1,
        "full_action_space": False,
        "repeat_action_probability": 0.0,
        "difficulty": 0,
        "max_episode_steps": 300,  # This will add a TimeLimit wrapper
    }
    
    env2 = wrap_atari_for_new_api_stack(
        gym.make("ale_py:ALE/Pong-v5", **env2_config), 
        frameskip=4, 
        framestack=3
    )
    
    wrappers2 = analyze_wrapper_chain(env2, "ENV2 (with max_episode_steps=300)")
    
    # Test reset behavior with detailed step tracking
    print("\n" + "="*60)
    print("DETAILED RESET BEHAVIOR TESTING")
    print("="*60)
    
    for env_num, (env, wrappers, name) in enumerate([(env1, wrappers1, "ENV1"), (env2, wrappers2, "ENV2")]):
        print(f"\n{'='*40}")
        print(f"{name} DETAILED TESTING")
        print(f"{'='*40}")
        
        # Test multiple resets like ENV3
        for reset_num in range(3):
            print(f"\n--- {name} Reset #{reset_num + 1} ---")
            
            print("BEFORE reset:")
            for j, wrapper in enumerate(wrappers):
                if hasattr(wrapper, '_elapsed_steps'):
                    print(f"  TimeLimit {j}: _elapsed_steps = {wrapper._elapsed_steps}")
            
            obs, info = env.reset()
            
            print("AFTER reset:")
            for j, wrapper in enumerate(wrappers):
                if hasattr(wrapper, '_elapsed_steps'):
                    print(f"  TimeLimit {j}: _elapsed_steps = {wrapper._elapsed_steps}")
            
            # Take 10 steps like ENV3
            for step in range(10):
                action = env.action_space.sample()
                obs, reward, terminated, truncated, info = env.step(action)
                
                if terminated or truncated:
                    print(f"  Episode ended at step {step+1}: term={terminated}, trunc={truncated}")
                    break
            
            print("AFTER 10 steps:")
            for j, wrapper in enumerate(wrappers):
                if hasattr(wrapper, '_elapsed_steps'):
                    print(f"  TimeLimit {j}: _elapsed_steps = {wrapper._elapsed_steps}")
        
        env.close()

def test_manual_double_timelimit():
    """Test what happens when we manually add multiple TimeLimit wrappers."""
    print("\n" + "="*60)
    print("TEST 3: MANUALLY ADDING DOUBLE TIMELIMIT WRAPPERS")
    print("="*60)
    
    # Create base environment
    base_env = gym.make("ale_py:ALE/Pong-v5", frameskip=1, full_action_space=False)
    
    # Add first TimeLimit wrapper
    env_with_one_limit = TimeLimit(base_env, max_episode_steps=1000)
    
    # Add second TimeLimit wrapper
    env_with_two_limits = TimeLimit(env_with_one_limit, max_episode_steps=300)
    
    wrappers = analyze_wrapper_chain(env_with_two_limits, "Double TimeLimit")
    
    print("\nTesting reset with double TimeLimit:")
    
    # Test multiple resets
    for reset_num in range(3):
        print(f"\n--- Reset #{reset_num + 1} ---")
        
        print("BEFORE reset:")
        for j, wrapper in enumerate(wrappers):
            if hasattr(wrapper, '_elapsed_steps'):
                print(f"  TimeLimit {j}: _elapsed_steps = {wrapper._elapsed_steps}")
        
        obs, info = env_with_two_limits.reset()
        
        print("AFTER reset:")
        for j, wrapper in enumerate(wrappers):
            if hasattr(wrapper, '_elapsed_steps'):
                print(f"  TimeLimit {j}: _elapsed_steps = {wrapper._elapsed_steps}")
        
        # Take 10 steps
        for step in range(10):
            action = env_with_two_limits.action_space.sample()
            obs, reward, terminated, truncated, info = env_with_two_limits.step(action)
            
            if terminated or truncated:
                print(f"  Episode ended at step {step+1}: term={terminated}, trunc={truncated}")
                break
        
        print("AFTER 10 steps:")
        for j, wrapper in enumerate(wrappers):
            if hasattr(wrapper, '_elapsed_steps'):
                print(f"  TimeLimit {j}: _elapsed_steps = {wrapper._elapsed_steps}")
    
    env_with_two_limits.close()

if __name__ == "__main__":
    test_reset_behavior()
    test_manual_double_timelimit()

The output looks something like this:

🔬 TESTING TIMELIMIT WRAPPER RESET BEHAVIOR

============================================================
TEST 1: Environment WITHOUT max_episode_steps
============================================================
A.L.E: Arcade Learning Environment (version 0.10.2+c9d4b19)
[Powered by Stella]

=== ENV1 (no max_episode_steps) Wrapper Chain Analysis ===
Depth 0: FrameStack
Depth 1: FireResetEnv
Depth 2: EpisodicLifeEnv
Depth 3: NoopResetEnv
Depth 4: MaxAndSkipEnv
Depth 5: NormalizedImageEnv
Depth 6: WarpFrame
Depth 7: TimeLimit
  → TimeLimit wrapper found!
    _max_episode_steps: 108000
    _elapsed_steps: None
Depth 8: OrderEnforcing
Depth 9: PassiveEnvChecker
Final env at depth 10: AtariEnv

============================================================
TEST 2: Environment WITH max_episode_steps
============================================================

=== ENV2 (with max_episode_steps=300) Wrapper Chain Analysis ===
Depth 0: FrameStack
Depth 1: FireResetEnv
Depth 2: EpisodicLifeEnv
Depth 3: NoopResetEnv
Depth 4: MaxAndSkipEnv
Depth 5: NormalizedImageEnv
Depth 6: WarpFrame
Depth 7: TimeLimit
  → TimeLimit wrapper found!
    _max_episode_steps: 108000
    _elapsed_steps: None
Depth 8: TimeLimit
  → TimeLimit wrapper found!
    _max_episode_steps: 300
    _elapsed_steps: None
Depth 9: OrderEnforcing
Depth 10: PassiveEnvChecker
Final env at depth 11: AtariEnv

============================================================
DETAILED RESET BEHAVIOR TESTING
============================================================

========================================
ENV1 DETAILED TESTING
========================================

--- ENV1 Reset #1 ---
BEFORE reset:
  TimeLimit 0: _elapsed_steps = None
AFTER reset:
  TimeLimit 0: _elapsed_steps = 32
AFTER 10 steps:
  TimeLimit 0: _elapsed_steps = 72

--- ENV1 Reset #2 ---
BEFORE reset:
  TimeLimit 0: _elapsed_steps = 72
AFTER reset:
  TimeLimit 0: _elapsed_steps = 84
AFTER 10 steps:
  TimeLimit 0: _elapsed_steps = 124

--- ENV1 Reset #3 ---
BEFORE reset:
  TimeLimit 0: _elapsed_steps = 124
AFTER reset:
  TimeLimit 0: _elapsed_steps = 136
AFTER 10 steps:
  TimeLimit 0: _elapsed_steps = 176

========================================
ENV2 DETAILED TESTING
========================================

--- ENV2 Reset #1 ---
BEFORE reset:
  TimeLimit 0: _elapsed_steps = None
  TimeLimit 1: _elapsed_steps = None
AFTER reset:
  TimeLimit 0: _elapsed_steps = 40
  TimeLimit 1: _elapsed_steps = 40
AFTER 10 steps:
  TimeLimit 0: _elapsed_steps = 80
  TimeLimit 1: _elapsed_steps = 80

--- ENV2 Reset #2 ---
BEFORE reset:
  TimeLimit 0: _elapsed_steps = 80
  TimeLimit 1: _elapsed_steps = 80
AFTER reset:
  TimeLimit 0: _elapsed_steps = 92
  TimeLimit 1: _elapsed_steps = 92
AFTER 10 steps:
  TimeLimit 0: _elapsed_steps = 132
  TimeLimit 1: _elapsed_steps = 132

--- ENV2 Reset #3 ---
BEFORE reset:
  TimeLimit 0: _elapsed_steps = 132
  TimeLimit 1: _elapsed_steps = 132
AFTER reset:
  TimeLimit 0: _elapsed_steps = 144
  TimeLimit 1: _elapsed_steps = 144
AFTER 10 steps:
  TimeLimit 0: _elapsed_steps = 184
  TimeLimit 1: _elapsed_steps = 184

============================================================
TEST 3: MANUALLY ADDING DOUBLE TIMELIMIT WRAPPERS
============================================================

=== Double TimeLimit Wrapper Chain Analysis ===
Depth 0: TimeLimit
  → TimeLimit wrapper found!
    _max_episode_steps: 300
    _elapsed_steps: None
Depth 1: TimeLimit
  → TimeLimit wrapper found!
    _max_episode_steps: 1000
    _elapsed_steps: None
Depth 2: OrderEnforcing
Depth 3: PassiveEnvChecker
Final env at depth 4: AtariEnv

Testing reset with double TimeLimit:

--- Reset #1 ---
BEFORE reset:
  TimeLimit 0: _elapsed_steps = None
  TimeLimit 1: _elapsed_steps = None
AFTER reset:
  TimeLimit 0: _elapsed_steps = 0
  TimeLimit 1: _elapsed_steps = 0
AFTER 10 steps:
  TimeLimit 0: _elapsed_steps = 10
  TimeLimit 1: _elapsed_steps = 10

--- Reset #2 ---
BEFORE reset:
  TimeLimit 0: _elapsed_steps = 10
  TimeLimit 1: _elapsed_steps = 10
AFTER reset:
  TimeLimit 0: _elapsed_steps = 0
  TimeLimit 1: _elapsed_steps = 0
AFTER 10 steps:
  TimeLimit 0: _elapsed_steps = 10
  TimeLimit 1: _elapsed_steps = 10

--- Reset #3 ---
BEFORE reset:
  TimeLimit 0: _elapsed_steps = 10
  TimeLimit 1: _elapsed_steps = 10
AFTER reset:
  TimeLimit 0: _elapsed_steps = 0
  TimeLimit 1: _elapsed_steps = 0
AFTER 10 steps:
  TimeLimit 0: _elapsed_steps = 10
  TimeLimit 1: _elapsed_steps = 10