Offline data tutorial sub-performs

1. Severity of the issue: (select one)
Medium: Significantly affects my productivity but can find a workaround.

2. Environment:

  • Ray version: 2.45.0
  • Python version: 3.12.10
  • OS: linux
  • Cloud/Infrastructure:
  • Other libs/tools (if relevant): torch 2.8.0.dev20250414+cu128

3. What happened vs. what you expected:

  • Expected: Trained algo restored from checkpoint should perform well
  • Actual: Trained algo performs badly

Hi !

I am trying to follow along the tutorial Working with offline data — Ray 2.45.0. I have succesfully run that step (training-an-expert-policy) to train an algo and verified the algo is indeed trained.

(PPO(env=CartPole-v1; env-runners=2; learners=0; multi-agent=False) pid=3062681) Checkpoint successfully created at: Checkpoint(filesystem=local, path=/home/xxxx/ray_results/docs_rllib_offline_pretrain_ppo/PPO_CartPole-v1_a12cf_00000_0_2025-05-07_13-30-08/checkpoint_000025)

Trial PPO_CartPole-v1_a12cf_00000 finished iteration 27 at 2025-05-07 13:30:39. Total running time: 31s
╭─────────────────────────────────────────────────────╮
│ Trial PPO_CartPole-v1_a12cf_00000 result                                                                      │
├─────────────────────────────────────────────────────┤
│ env_runners/episode_len_mean                 459.26
│ env_runners/episode_return_mean              459.26
│ num_env_steps_sampled_lifetime               108000
╰─────────────────────────────────────────────────────╯

But when I reload this checkpoint to perform step2, Record expert data to local disk I am getting suboptimal results.
'episode_return_max': 36.0
'agent_episode_returns_mean': {'default_agent': 16.0}
Those numbers are more or less the same for all 10 iterations. So when I go to step 3 (behavioral cloning) of course, the newly trained algo subperforms as well.

Any idea what could be wrong at that step2 (Record expert data to local disk) ?

Hi Iamgroot,
Can you please post your PPOConfig here if you can? There are a few different reasons that I can think of why this might be happening but let me know if you are using the same one in the tutorial that you linked.

Christina,

I think the configs I used for step 1 and 2 are the same as the tutorial

For step 1 - training:

    PPOConfig()
    .environment("CartPole-v1")
    .training(
        lr=0.0003,
        # Run 6 SGD minibatch iterations on a batch.
        num_epochs=6,
        # Weigh the value function loss smaller than
        # the policy loss.
        vf_loss_coeff=0.01,
    )
    .rl_module(
        model_config=DefaultModelConfig(
            fcnet_hiddens=[32],
            fcnet_activation="linear",
            # Share encoder layers between value network
            # and policy.
            vf_share_layers=True,
        ),
    )
)

For step 2:

config = (
    PPOConfig()
    # The environment needs to be specified.
    .environment(
        env="CartPole-v1",
    )
    # Make sure to sample complete episodes because
    # you want to record RLlib's episode objects.
    .env_runners(
        batch_mode="complete_episodes",
    )
    # Set up 5 evaluation `EnvRunners` for recording.
    # Sample 50 episodes in each evaluation rollout.
    .evaluation(
        evaluation_num_env_runners=5,
        evaluation_duration=50,
        evaluation_duration_unit="episodes",
    )
    # Use the checkpointed expert policy from the preceding PPO training.
    # Note, we have to use the same `model_config` as
    # the one with which the expert policy was trained, otherwise
    # the module state can't be loaded.
    .rl_module(
        model_config=DefaultModelConfig(
            fcnet_hiddens=[32],
            fcnet_activation="linear",
            # Share encoder layers between value network
            # and policy.
            vf_share_layers=True,
        ),
    )
    # Define the output path and format. In this example you
    # want to store data directly in RLlib's episode objects.
    # Each Parquet file should hold no more than 25 episodes.
    .offline_data(
        output=data_path,
        output_write_episodes=True,
        output_max_rows_per_file=25,
        # output_write_episodes=False,
        # output_max_rows_per_file=500,
    )
)

The checkpoint was reloaded with:

# Build the algorithm.
algo = config.build_algo()
# Load now the PPO-trained `RLModule` to use in recording.
algo.restore_from_path(
    best_checkpoint,
    # Load only the `RLModule` component here.
    component=COMPONENT_RL_MODULE,
)

best checkpoint is the path matching best checkpoint on my drive. '/home/xxxxx/ray_results/docs_rllib_offline_pretrain_ppo/PPO_CartPole-v1_a12cf_00000_0_2025-05-07_13-30-08/checkpoint_000025'