Replay buffer with episodes as storage unit not training

I am trying to train a sparse reward Fetch push environment using DDPG with a replay buffer. When I use the PER replay buffer with with time steps as the storage unit it appears to train fine. When I use episodes for the storage unit, num_env_steps_trained reported in the training results remains at zero and it appears as though no training is happening. Am I missing a setting or hyper parameter that would make training work?

Here is the training code that I am using:

from ray import air, tune
from ray.rllib.algorithms.ddpg import DDPGConfig
from ray.rllib.utils.replay_buffers.multi_agent_prioritized_replay_buffer import MultiAgentPrioritizedReplayBuffer

config = (
    DDPGConfig()
    .environment("FetchPush-v2")
    .training(replay_buffer_config={
        "type": "MultiAgentPrioritizedReplayBuffer",
        "storage_unit": "episodes"
        },)
)

tune.Tuner(
    "DDPG",
    param_space=config.to_dict(),
    run_config=air.RunConfig(stop={"training_iteration": 20}),
).fit()

Here is a sample of the output when using episodes for the storage unit:

Training finished iteration 1 at 2023-08-16 20:16:18. Total running time: 3s
╭─────────────────────────────────────────────╮
│ Training result                             │
├─────────────────────────────────────────────┤
│ episodes_total                           20 │
│ num_env_steps_sampled                  1000 │
│ num_env_steps_trained                     0 │
│ sampler_results/episode_len_mean         50 │
│ sampler_results/episode_reward_mean   -47.5 │
╰─────────────────────────────────────────────╯

(DDPG pid=77204) 2023-08-16 20:16:19,247        WARNING deprecation.py:50 -- DeprecationWarning: `ray.rllib.execution.train_ops.multi_gpu_train_one_step` has been deprecated. This will raise an error in the future!
Training finished iteration 2 at 2023-08-16 20:16:20. Total running time: 5s
╭──────────────────────────────────────────────╮
│ Training result                              │
├──────────────────────────────────────────────┤
│ episodes_total                            40 │
│ num_env_steps_sampled                   2000 │
│ num_env_steps_trained                      0 │
│ sampler_results/episode_len_mean          50 │
│ sampler_results/episode_reward_mean   -48.75 │
╰──────────────────────────────────────────────╯

Training finished iteration 3 at 2023-08-16 20:16:22. Total running time: 7s
╭─────────────────────────────────────────────╮
│ Training result                             │
├─────────────────────────────────────────────┤
│ episodes_total                           60 │
│ num_env_steps_sampled                  3000 │
│ num_env_steps_trained                     0 │
│ sampler_results/episode_len_mean         50 │
│ sampler_results/episode_reward_mean   -47.5 │
╰─────────────────────────────────────────────╯

Training finished iteration 4 at 2023-08-16 20:16:24. Total running time: 10s
╭──────────────────────────────────────────────╮
│ Training result                              │
├──────────────────────
────────────────────────┤
│ episodes_total                            80 │
│ num_env_steps_sampled                   4000 │
│ num_env_steps_trained                      0 │
│ sampler_results/episode_len_mean          50 │
│ sampler_results/episode_reward_mean   -46.25 │
╰──────────────────────────────────────────────╯

Training finished iteration 5 at 2023-08-16 20:16:26. Total running time: 12s
╭────────────────────────────────────────────╮
│ Training result                            │
├────────────────────────────────────────────┤
│ episodes_total                         100 │
│ num_env_steps_sampled                 5000 │
│ num_env_steps_trained                    0 │
│ sampler_results/episode_len_mean        50 │
│ sampler_results/episode_reward_mean    -46 │
╰────────────────────────────────────────────╯

Ultimately, I want to train using a HER buffer where I will store full episodes then at sampling generate HER experience. Any suggestions for the HER buffer would be welcome too.

Task severity:

  • High: It blocks me to complete my task.

I found the solution to my issue. When I set the config to include .rollouts(batch_mode="complete_episodes") then the num_env_steps_trained is no longer zero and training occurs.

The updated script is as follows:

from ray import air, tune
from ray.rllib.algorithms.ddpg import DDPGConfig
from ray.rllib.utils.replay_buffers.multi_agent_prioritized_replay_buffer import MultiAgentPrioritizedReplayBuffer

config = (
    DDPGConfig()
    .environment("FetchPush-v2")
    .training(replay_buffer_config={
        "type": "MultiAgentPrioritizedReplayBuffer",
        "storage_unit": "episodes"
        },)
    .rollouts(batch_mode="complete_episodes")
)

tune.Tuner(
    "DDPG",
    param_space=config.to_dict(),
    run_config=air.RunConfig(stop={"training_iteration": 20}),
).fit()

A sample of the results are shown below:

Training finished iteration 1 at 2023-08-17 20:47:50. Total running time: 3s
╭─────────────────────────────────────────────╮
│ Training result                             │
├─────────────────────────────────────────────┤
│ episodes_total                           20 │
│ num_env_steps_sampled                  1000 │
│ num_env_steps_trained                     0 │
│ sampler_results/episode_len_mean         50 │
│ sampler_results/episode_reward_mean   -44.9 │
╰─────────────────────────────────────────────╯

(DDPG pid=1264) 2023-08-17 20:47:50,903 WARNING deprecation.py:50 -- DeprecationWarning: `ray.rllib.execution.train_ops.multi_gpu_train_one_step` has been deprecated. This will raise an error in the future!
Training finished iteration 2 at 2023-08-17 20:47:55. Total running time: 8s
╭──────────────────────────────────────────────╮
│ Training result                              │
├──────────────────────────────────────────────┤
│ episodes_total                            40 │
│ num_env_steps_sampled                   2000 │
│ num_env_steps_trained                 128000 │
│ sampler_results/episode_len_mean          50 │
│ sampler_results/episode_reward_mean   -44.95 │
╰──────────────────────────────────────────────╯

Training finished iteration 3 at 2023-08-17 20:48:03. Total running time: 17s
╭────────────────────────────────────────────────╮
│ Training result                                │
├────────────────────────────────────────────────┤
│ episodes_total                              60 │
│ num_env_steps_sampled                     3000 │
│ num_env_steps_trained                   384000 │
│ sampler_results/episode_len_mean            50 │
│ sampler_results/episode_reward_mean   -44.8667 │
╰────────────────────────────────────────────────╯

Training finished iteration 4 at 2023-08-17 20:48:11. Total running time: 25s
╭────────────────────────────────────────────────╮
│ Training result                                │
├────────────────────────────────────────────────┤
│ episodes_total                              80 │
│ num_env_steps_sampled                     4000 │
│ num_env_steps_trained                   640000 │
│ sampler_results/episode_len_mean            50 │
│ sampler_results/episode_reward_mean   -46.0875 │
╰────────────────────────────────────────────────╯

Trial status: 1 RUNNING
Current time: 2023-08-17 20:48:16. Total running time: 30s
Logical resource usage: 1.0/8 CPUs, 0/0 GPUs
╭──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ Trial name                      status       iter     total time (s)     ts     reward     episode_reward_max     episode_reward_min     episode_len_mean     episodes_this_iter │
├──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ DDPG_FetchPush-v2_3ad79_00000   RUNNING         4            23.1385   4000   -46.0875                      0                    -50                   50                     20 │
╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯

Training finished iteration 5 at 2023-08-17 20:48:20. Total running time: 33s
╭──────────────────────────────────────────────╮
│ Training result                              │
├──────────────────────────────────────────────┤
│ episodes_total                           100 │
│ num_env_steps_sampled                   5000 │
│ num_env_steps_trained                 896000 │
│ sampler_results/episode_len_mean          50 │
│ sampler_results/episode_reward_mean   -46.37 │
╰──────────────────────────────────────────────╯
1 Like