Handling of Incomplete Episodes in RLlib

How severe does this issue affect your experience of using Ray?

  • High: It blocks me to complete my task.

I have a long traffic simulation with rewards every step. Lets say one day with 86.400 steps.
I want a lower train_batch_size e.g. 2.000 to update my model with the collected samples.
I use 20 rollout workers each with one env, so it should collect 100 samples from each environment.

I saw in other issues that ray only reports rewards when my simulations are terminated. Before they are just NaN. So in my case the first 86.400 *20 steps.
I have set rollout_fragment_length = config.ppo.train_batch_size // num_rollout_workers and batch_mode="truncate_episodes"

When no simulation terminates - does the algorithm still learn with intermediate rewards or does ray only train when simulations are completed?
Can I do anything, that it reports intermediate rewards more frequently? - Also my metrics logged to wandb are not logged frequently.

Versions:

  • python 3.10.14
  • ray 2.10.0

To answer your first question, truncated episodes are indeed used during training. I actually went into the code to poke at that just yesterday for an experiment I was running.

As far as reporting intermediate reward, yes, it’s no trouble at all. Just write a callback that logs your custom metric (that’d be your average reward per timestep, I’d think) on sample end. The examples folder has some code for callbacks and custom logging that’ll help you out there.

Edit: September of last year, just noticed. Ah well, hopefully it’s a useful reference to someone with the same question later.