Have one iteration of algo.train() shut down after one episode

1. Severity of the issue: (select one)
None: I’m just curious or want clarification.
Low: Annoying but doesn’t hinder my work.
[O] Medium: Significantly affects my productivity but can find a workaround.
High: Completely blocks me.

2. Environment:

  • Ray version: 2.43
  • Python version: 3.12.9
  • OS: Windows 11
  • Cloud/Infrastructure:
  • Other libs/tools (if relevant): torch 2.5, np 2.2

3. What happened vs. what you expected:

  • Expected: algo.train() method to terminate when an episode is terminated.
  • Actual: After the episode is terminated, resets and starts collect again until the total step size hits 1000.

This is my current code:

my_dqn_config = (
    DQNConfig()
    .environment(
        env='my_env,
        env_config=my_config,
    )
    .training(
        replay_buffer_config=replay_buffer_config,
        # train_batch_size_per_learner=32,
        # num_epochs=3,
        # shuffle_batch_per_epoch=True,
        model={
            "fcnet_hiddens": [256, 256]
            },
        num_steps_sampled_before_learning_starts=100
    )
    .learners(
        # TODO(@chungs4): This will fail without setting an environment
        # ray.init(runtime_env={"env_vars": {"USE_LIBUV": "0"}}) due to changes in Pytorch >= 2.4.0
        num_learners=1,
        # TODO(@chungs4): GPU Implementation with NCCL
        # num_gpus_per_learner=1
    )
    .env_runners(
        num_env_runners=1,
        batch_mode="complete_episodes"
    )
)

The training starts after collecting around 100 samples (due to num_steps_sampled_before_learning_starts=100) and the batch is cut off by each episode (due to batch_mode=“complete_episodes”). However, I the training keeps running even though I want train() method to shut down after one episode.

Hi there and welcome to the Ray community!

Can you please post the stop_config that you are using with your training? There is an example of it here being used in RLllib : Replay Buffers — Ray 2.43.0

Thank you!!

Hi @sunghyun.chung,

I think you are going to have a hard time getting it to stop after exactly one episode unless your episode has a fixed length.

Here is the logic for how rllib decides when to stop one training iteration.

You could create a custom DQN algorithm and overload the should_stop method.

The reason your are seeing 1000 timesteps is because of DQNs default min_sample_timesteps_per_iteration:

Hey Christina. Thanks for the reply.

Rather than training, it was more of .reporting() method problem since it was directly relevant to how data is mounted on tensorboard. I got around by setting:

    .reporting(
        # I know that the episode will last at least 30+ no matter what.
        min_sample_timesteps_per_iteration=10
    )
  1. Can I also set min_sample_timesteps_per_iteration using stop_config?
  2. I hard-coded the value min_sample_timesteps_per_iteration=10 because the episode will run at least 30 timesteps so that the every train() would correspond to one episode. Is there a way (or a variable) to enforce 1-1 matching between one iteration and one episode without hardcoding as I did above?

Thanks for the reply, mannyv.

It would have been a lot easier if I saw this comment earlier.
I think overriding AlgorithmConfig or its subclass is a good idea. Thanks for the suggestion.

Meanwhile, what would happen if I set min_sample_timesteps_per_iteration as 0 or 1? Wouldn’t it clip iteration of train() call to match on episode? (since I am batching by an episode in env_runner)