I am trying to train a sparse reward Fetch push environment using DDPG with a replay buffer. When I use the PER replay buffer with with time steps as the storage unit it appears to train fine. When I use episodes for the storage unit, num_env_steps_trained reported in the training results remains at zero and it appears as though no training is happening. Am I missing a setting or hyper parameter that would make training work?
Here is the training code that I am using:
from ray import air, tune
from ray.rllib.algorithms.ddpg import DDPGConfig
from ray.rllib.utils.replay_buffers.multi_agent_prioritized_replay_buffer import MultiAgentPrioritizedReplayBuffer
config = (
DDPGConfig()
.environment("FetchPush-v2")
.training(replay_buffer_config={
"type": "MultiAgentPrioritizedReplayBuffer",
"storage_unit": "episodes"
},)
)
tune.Tuner(
"DDPG",
param_space=config.to_dict(),
run_config=air.RunConfig(stop={"training_iteration": 20}),
).fit()
Here is a sample of the output when using episodes for the storage unit:
Training finished iteration 1 at 2023-08-16 20:16:18. Total running time: 3s
╭─────────────────────────────────────────────╮
│ Training result │
├─────────────────────────────────────────────┤
│ episodes_total 20 │
│ num_env_steps_sampled 1000 │
│ num_env_steps_trained 0 │
│ sampler_results/episode_len_mean 50 │
│ sampler_results/episode_reward_mean -47.5 │
╰─────────────────────────────────────────────╯
(DDPG pid=77204) 2023-08-16 20:16:19,247 WARNING deprecation.py:50 -- DeprecationWarning: `ray.rllib.execution.train_ops.multi_gpu_train_one_step` has been deprecated. This will raise an error in the future!
Training finished iteration 2 at 2023-08-16 20:16:20. Total running time: 5s
╭──────────────────────────────────────────────╮
│ Training result │
├──────────────────────────────────────────────┤
│ episodes_total 40 │
│ num_env_steps_sampled 2000 │
│ num_env_steps_trained 0 │
│ sampler_results/episode_len_mean 50 │
│ sampler_results/episode_reward_mean -48.75 │
╰──────────────────────────────────────────────╯
Training finished iteration 3 at 2023-08-16 20:16:22. Total running time: 7s
╭─────────────────────────────────────────────╮
│ Training result │
├─────────────────────────────────────────────┤
│ episodes_total 60 │
│ num_env_steps_sampled 3000 │
│ num_env_steps_trained 0 │
│ sampler_results/episode_len_mean 50 │
│ sampler_results/episode_reward_mean -47.5 │
╰─────────────────────────────────────────────╯
Training finished iteration 4 at 2023-08-16 20:16:24. Total running time: 10s
╭──────────────────────────────────────────────╮
│ Training result │
├──────────────────────
────────────────────────┤
│ episodes_total 80 │
│ num_env_steps_sampled 4000 │
│ num_env_steps_trained 0 │
│ sampler_results/episode_len_mean 50 │
│ sampler_results/episode_reward_mean -46.25 │
╰──────────────────────────────────────────────╯
Training finished iteration 5 at 2023-08-16 20:16:26. Total running time: 12s
╭────────────────────────────────────────────╮
│ Training result │
├────────────────────────────────────────────┤
│ episodes_total 100 │
│ num_env_steps_sampled 5000 │
│ num_env_steps_trained 0 │
│ sampler_results/episode_len_mean 50 │
│ sampler_results/episode_reward_mean -46 │
╰────────────────────────────────────────────╯
Ultimately, I want to train using a HER buffer where I will store full episodes then at sampling generate HER experience. Any suggestions for the HER buffer would be welcome too.
Task severity:
- High: It blocks me to complete my task.