I am using default config setting for ray.rllib.agents.dqn.DQNTrainer(config=config, env=select_env)
When I run one execution of DQNTrainer.train() method, it gives an arbitrary number of episodes and arbitrary number of training iterations.
Can someone please help me understand step-by-step how does DQNTrainer.train() method decide to stop?
You always get one training iteration upon calling trainer.train(). You can check this via trainer.iteration afterwards.
The number of episodes depends on the environment (does it stop early?, etc…).
And env steps depend on the algorithm’s execution plan. Some algos perform a fixed set of update steps per iteration, e.g. PPO. Others have a timing mechanism, like DQN, IMPALA.
In particular, DQN has this here in its plan (ray/rllib/agents/dqn/dqn.py):
train_op = Concurrently(
[store_op, replay_op],
mode="round_robin",
output_indexes=[1],
round_robin_weights=calculate_rr_weights(config)) # <- these weights here determine the ratio between a) sample-collection-and-storing and b) sampling-from-buffer-and-train-one-step updates
The round robin weighting causes each iteration to contain a slightly different number of steps.