1. Severity of the issue: (select one)
Medium: Significantly affects my productivity but can find a workaround.
2. Environment:
- Ray version: 2.45.0
- Python version: 3.12.10
- OS: linux
- Cloud/Infrastructure:
- Other libs/tools (if relevant): torch 2.8.0.dev20250414+cu128
3. What happened vs. what you expected:
- Expected: Trained algo restored from checkpoint should perform well
- Actual: Trained algo performs badly
Hi !
I am trying to follow along the tutorial Working with offline data — Ray 2.45.0. I have succesfully run that step (training-an-expert-policy) to train an algo and verified the algo is indeed trained.
(PPO(env=CartPole-v1; env-runners=2; learners=0; multi-agent=False) pid=3062681) Checkpoint successfully created at: Checkpoint(filesystem=local, path=/home/xxxx/ray_results/docs_rllib_offline_pretrain_ppo/PPO_CartPole-v1_a12cf_00000_0_2025-05-07_13-30-08/checkpoint_000025)
Trial PPO_CartPole-v1_a12cf_00000 finished iteration 27 at 2025-05-07 13:30:39. Total running time: 31s
╭─────────────────────────────────────────────────────╮
│ Trial PPO_CartPole-v1_a12cf_00000 result │
├─────────────────────────────────────────────────────┤
│ env_runners/episode_len_mean 459.26
│ env_runners/episode_return_mean 459.26
│ num_env_steps_sampled_lifetime 108000
╰─────────────────────────────────────────────────────╯
But when I reload this checkpoint to perform step2, Record expert data to local disk
I am getting suboptimal results.
'episode_return_max': 36.0
'agent_episode_returns_mean': {'default_agent': 16.0}
Those numbers are more or less the same for all 10 iterations. So when I go to step 3 (behavioral cloning) of course, the newly trained algo subperforms as well.
Any idea what could be wrong at that step2 (Record expert data to local disk
) ?
Hi Iamgroot,
Can you please post your PPOConfig
here if you can? There are a few different reasons that I can think of why this might be happening but let me know if you are using the same one in the tutorial that you linked.
Christina,
I think the configs I used for step 1 and 2 are the same as the tutorial
For step 1 - training:
PPOConfig()
.environment("CartPole-v1")
.training(
lr=0.0003,
# Run 6 SGD minibatch iterations on a batch.
num_epochs=6,
# Weigh the value function loss smaller than
# the policy loss.
vf_loss_coeff=0.01,
)
.rl_module(
model_config=DefaultModelConfig(
fcnet_hiddens=[32],
fcnet_activation="linear",
# Share encoder layers between value network
# and policy.
vf_share_layers=True,
),
)
)
For step 2:
config = (
PPOConfig()
# The environment needs to be specified.
.environment(
env="CartPole-v1",
)
# Make sure to sample complete episodes because
# you want to record RLlib's episode objects.
.env_runners(
batch_mode="complete_episodes",
)
# Set up 5 evaluation `EnvRunners` for recording.
# Sample 50 episodes in each evaluation rollout.
.evaluation(
evaluation_num_env_runners=5,
evaluation_duration=50,
evaluation_duration_unit="episodes",
)
# Use the checkpointed expert policy from the preceding PPO training.
# Note, we have to use the same `model_config` as
# the one with which the expert policy was trained, otherwise
# the module state can't be loaded.
.rl_module(
model_config=DefaultModelConfig(
fcnet_hiddens=[32],
fcnet_activation="linear",
# Share encoder layers between value network
# and policy.
vf_share_layers=True,
),
)
# Define the output path and format. In this example you
# want to store data directly in RLlib's episode objects.
# Each Parquet file should hold no more than 25 episodes.
.offline_data(
output=data_path,
output_write_episodes=True,
output_max_rows_per_file=25,
# output_write_episodes=False,
# output_max_rows_per_file=500,
)
)
The checkpoint was reloaded with:
# Build the algorithm.
algo = config.build_algo()
# Load now the PPO-trained `RLModule` to use in recording.
algo.restore_from_path(
best_checkpoint,
# Load only the `RLModule` component here.
component=COMPONENT_RL_MODULE,
)
best checkpoint is the path matching best checkpoint on my drive. '/home/xxxxx/ray_results/docs_rllib_offline_pretrain_ppo/PPO_CartPole-v1_a12cf_00000_0_2025-05-07_13-30-08/checkpoint_000025'