Stopping condition in Tune confusion

How severe does this issue affect your experience of using Ray?

  • Low: It annoys or frustrates me for a moment.


I would like to ask, I have the very simple following script to run PPO as independent policies on PettingZoo environments:

def env_creator_coop_pong(args):
    env = cooperative_pong_v5.env()  
    env = ss.max_observation_v0(env, 2)
    env = ss.sticky_actions_v0(env, repeat_action_probability=0.25)
    env = ss.frame_skip_v0(env,4)
    env = ss.resize_v0(env, 84, 84)
    env = ss.frame_stack_v1(env, 4)
    return PettingZooEnv(env)

env = env_creator_coop_pong({})
register_env("env_creator_coop_pong", env_creator_coop_pong)  

analysis =
        stop={"episodes_total": 1},
        #stop = {"timesteps_total": 5000},
            # Enviroment specific.
            "env": "env_creator_coop_pong",
            # General
            "num_gpus": 0,
            "num_workers": 4,
            "num_envs_per_worker": 8,
           # "learning_starts": 1000,
            #"buffer_size": int(1e5),
            "compress_observations": True,
            "rollout_fragment_length": 20,
            "train_batch_size": 512,
            "gamma": 0.99,
           # "n_step": 3,
            "lr": 0.0001,
            #"prioritized_replay_alpha": 0.5,
            #"final_prioritized_replay_beta": 1.0,
            #"target_network_update_freq": 50000,
            "timesteps_per_iteration": 25000,
                    # Method specific
        "multiagent": {
                        "policies": set(env.agents),
                        "policy_mapping_fn": (lambda agent_id, episode, **kwargs: agent_id),


For simple benchmarking purposes and it runs for more than 15 minutes! I do not understand, does not my “stop: episodes_total” force the algorithm to quit after running a single episode? It cannot be that a single episode takes 15 minutes! I am then afraid I am misunderstanding how the stop condition works.

Then, I also ran the same experiment with a different MPE environment with “episodes_total = 10” but after inspecting my hist_stats and “episodes_this_iter” in the output trials it gives me 500 runs. What am I missing here?

@Constantine_Bardis ,

this is probably related to the distributed training and the configuration you have set. So, what usually happens in RLlib is that the workers get spawn (in your case 4) and collect in each environment (in your case 8) the rollout_fragment_length=20. So per rollout your workers collect num_workers*num_envs_per_worker*rollout_fragment_length=640 timesteps. If your environments are slow this could take already time. Furthermore, you set timesteps_per_iteration to 25000, which tells RLlib to repeat the rollouts until this number of timesteps is reached (approximately after 40 rollouts). Only after this a training is made with the train_batch_size and num_iter_sgd training steps. If this has been finished the metrics are returned to the caller and can be compared to the stopping criteria.

The point is that the Trainable.step() function that is called by returns only back after a train iteration has completed. Then the stopping criteria can be compared to the metrics. So, a very intensive rollout could explain the long time it takes until tune stops.

Hope this clarifies it a bit.