Understanding the Stopping Process for ray.rllib.agents.dqn.DQNTrainer.train()

Hi everyone.

I am using default config setting for ray.rllib.agents.dqn.DQNTrainer(config=config, env=select_env)
When I run one execution of DQNTrainer.train() method, it gives an arbitrary number of episodes and arbitrary number of training iterations.

Can someone please help me understand step-by-step how does DQNTrainer.train() method decide to stop?

cc @sven1977 can you help with this?

@sven1977 , I will really appreciate any help on this.

Hey @Saurabh_Arora , thanks for this question!

You always get one training iteration upon calling trainer.train(). You can check this via trainer.iteration afterwards.

The number of episodes depends on the environment (does it stop early?, etc…).

And env steps depend on the algorithm’s execution plan. Some algos perform a fixed set of update steps per iteration, e.g. PPO. Others have a timing mechanism, like DQN, IMPALA.

In particular, DQN has this here in its plan (ray/rllib/agents/dqn/dqn.py):

    train_op = Concurrently(
        [store_op, replay_op],
        mode="round_robin",
        output_indexes=[1],
        round_robin_weights=calculate_rr_weights(config))  # <- these weights here determine the ratio between a) sample-collection-and-storing and b) sampling-from-buffer-and-train-one-step updates

The round robin weighting causes each iteration to contain a slightly different number of steps.

@sven1977 , thanks for responding. Can I customize stopping criterion of trainer.train() while running PPO (or DQN)?