DreamerV3 hangs when using a loop for multiple training sessions

kapibarek · March 5, 2024, 9:31pm

Hi,
I’ve been testing the DreamerV3 algo with a custom Gymnasium environment inside a Jupyter Notebook, if I train using single instances, everything runs smoothly but the moment I use a loop for multiple training sessions, the algo keeps hanging, e.g.:

from ray.rllib.algorithms.dreamerv3 import DreamerV3, DreamerV3Config

num_sessions = 10

for session in range(num_sessions):
    print(f"Starting training session {session+1}")

    config = (
        DreamerV3Config()
        .environment("test_env_v0")
        .training(
            model_size="XS",
            training_ratio=1,
            model={
                "batch_size_B": 1,
                "batch_length_T": 1,
                "horizon_H": 1,
                "gamma": 0.997,
                "model_size": "XS",
            },
        )
    )

    algo = config.build()

    for i in range(100):
        result = algo.train()
        print(f"Iteration: {i+1} Timesteps total: {result['agent_timesteps_total']} Steps trained: {result['num_env_steps_trained']} Episodes total: {result['episodes_total']}")

Starting training session 1
... 
Iteration: 98 Timesteps total: 99329 Steps trained: 100352 Episodes total: 4138
Iteration: 99 Timesteps total: 100353 Steps trained: 101376 Episodes total: 4181
Iteration: 100 Timesteps total: 101377 Steps trained: 102400 Episodes total: 4224

Starting training session 2
...
Iteration: 10 Timesteps total: 9217 Steps trained: 10240 Episodes total: 384
Iteration: 11 Timesteps total: 10241 Steps trained: 11264 Episodes total: 426
Iteration: 12 Timesteps total: 11265 Steps trained: 12288 Episodes total: 469
Iteration: 13 Timesteps total: 12289 Steps trained: 13312 Episodes total: 512
Iteration: 14 Timesteps total: 13313 Steps trained: 14336 Episodes total: 554
Iteration: 15 Timesteps total: 14337 Steps trained: 15360 Episodes total: 597
Iteration: 16 Timesteps total: 15361 Steps trained: 16384 Episodes total: 640

The second session is hanging after 16 iterations.

What is the default workflow to run multiple training sessions? Should one just avoid using Jupyter Notebooks?

kapibarek · March 12, 2024, 4:26pm

Anyone else is experiencing something similar?

Topic		Replies	Views
Dreamer V3 with CartPole environment in Ray 2.9.2 RLlib	0	73	August 2, 2024
What is the correct way to train the DreamerV3 algorithm? RLlib	0	139	August 4, 2024
How to troubleshoot hang during a train rollout? RLlib	4	351	November 24, 2022
Training keeps getting stuck Debugging and performance tuning	3	1199	May 25, 2023
All Algorithms are registered and DreamerV3 fails for CartpoleDebug-v0 RLlib	1	135	April 13, 2024

DreamerV3 hangs when using a loop for multiple training sessions

Related topics