1. Severity of the issue: (select one)
None: I’m just curious or want clarification.
Low: Annoying but doesn’t hinder my work.
[O] Medium: Significantly affects my productivity but can find a workaround.
High: Completely blocks me.
2. Environment:
Ray version: 2.43
Python version: 3.12.9
OS: Windows 11
Cloud/Infrastructure:
Other libs/tools (if relevant): torch 2.5, np 2.2
3. What happened vs. what you expected:
Expected: algo.train() method to terminate when an episode is terminated.
Actual: After the episode is terminated, resets and starts collect again until the total step size hits 1000.
This is my current code:
my_dqn_config = (
DQNConfig()
.environment(
env='my_env,
env_config=my_config,
)
.training(
replay_buffer_config=replay_buffer_config,
# train_batch_size_per_learner=32,
# num_epochs=3,
# shuffle_batch_per_epoch=True,
model={
"fcnet_hiddens": [256, 256]
},
num_steps_sampled_before_learning_starts=100
)
.learners(
# TODO(@chungs4): This will fail without setting an environment
# ray.init(runtime_env={"env_vars": {"USE_LIBUV": "0"}}) due to changes in Pytorch >= 2.4.0
num_learners=1,
# TODO(@chungs4): GPU Implementation with NCCL
# num_gpus_per_learner=1
)
.env_runners(
num_env_runners=1,
batch_mode="complete_episodes"
)
)
The training starts after collecting around 100 samples (due to num_steps_sampled_before_learning_starts=100) and the batch is cut off by each episode (due to batch_mode=“complete_episodes”). However, I the training keeps running even though I want train() method to shut down after one episode.
Can you please post the stop_config that you are using with your training? There is an example of it here being used in RLllib : Replay Buffers — Ray 2.43.0
Rather than training, it was more of .reporting() method problem since it was directly relevant to how data is mounted on tensorboard. I got around by setting:
.reporting(
# I know that the episode will last at least 30+ no matter what.
min_sample_timesteps_per_iteration=10
)
Can I also set min_sample_timesteps_per_iteration using stop_config?
I hard-coded the value min_sample_timesteps_per_iteration=10 because the episode will run at least 30 timesteps so that the every train() would correspond to one episode. Is there a way (or a variable) to enforce 1-1 matching between one iteration and one episode without hardcoding as I did above?
It would have been a lot easier if I saw this comment earlier.
I think overriding AlgorithmConfig or its subclass is a good idea. Thanks for the suggestion.
Meanwhile, what would happen if I set min_sample_timesteps_per_iteration as 0 or 1? Wouldn’t it clip iteration of train() call to match on episode? (since I am batching by an episode in env_runner)