Constant episode_reward_mean over training, even setting horizon parameter

Hi everyone,

I’m encountering some issues while training my agent on a custom environment. Specifically, the value of result[“episode_reward_mean”] remains constant for several iterations during training. This behavior is significantly affecting the agent’s performance.

After searching through the forums, I found that this might be related to how Ray RLlib handles episode termination. I tried introducing the horizon parameter in my configuration dictionary, but the situation hasn’t improved.

I have some questions about how the horizon parameter works:
For instance, if its value is set to 1000 and my episode ends at step 900, I assume everything should work fine. But what happens if my environment doesn’t terminate even at step 1000? My episodes can have variable lengths, and I can’t predict their exact duration in advance.

Why, in your opinion, am I still facing this issue despite using the horizon parameter?

Here’s my configuration dictionary:

config = {
    "env": Custom_Env
    "env_config": env_config,
    "exploration_config": {
        "type": "StochasticSampling"
    },
    "model": {
        "custom_model": CustomNet,
    },
    "lr": 0.95,
    "gamma": 0.001,
    "num_workers": 12,
    "num_envs_per_worker": 8,
    "num_gpus": 1,
    "horizon": 1000
}

I’m using the PG algorithm.

Below, I’ve attached a plot of result[“episode_reward_mean”] over the training period. As you can see, there are noticeable plateaus in the graph.

Any insights or suggestions would be greatly appreciated!

Regards,
L.E.O.

Hi @LeoLeoLeo,

One point of clarification, if this plot is from the standard reporting metrics from rllib, then the x-axis is steps not episodes.

My guess, based on what you have shared, is that the flat rewards are cases where your environment has not terminated during sampling. The env metrics are only updated when an episode is terminated or truncated.

On way you can check this is to plot results[“num_episodes”], and results[“num_episodes_lifetime”] if it exists.

Which version of ray are you using? My understanding is that horizon has been deprecated in the latest versions.

Hi @mannyv,

Thank you for your response.

I am currently using Ray version 2.9.3, as I want to experiment with the PG algorithm, which has been deprecated in the more recent versions.

Through some trial and error, I was able to better understand the issue. Initially, I assumed that each iteration of calling the trainer.train() method corresponded to a training step. However, as I understand it now, each iteration actually refers to a cycle of rollout collection across the various workers and environments per worker. I resolved the issue by properly configuring the rollout_fragment_length and train_batch_size parameters.

I also observed that the result[“hist_stats”][“episode_reward”] vector does not updated during iterations where the reward plateaus. Therefore, I assume this vector should be used to monitor training performance. However, I still have a couple of questions:

  1. How can I increase the maximum size of this vector? It currently holds at most 100 elements.
  2. Alternatively, is there another vector or metric I should use to plot the rewards obtained during the training process?

Currently, I have truncated my episodes to a maximum length of 500 timesteps, but I am still uncertain about how to handle episodes when their lengths are unknown in advance.

Could you please help clarify these points?

Thank you in advance for your guidance.

Best regards,
L.E.O.

Hi @LeoLeoLeo ,

You were already looking at the correct value to track reward during training, episode_reward_mean. Each time it is logged it is the mean of the most recent 100 completed episodes. Those are the ones in hist_stats. If no episodes complete during the sample phase then the value will stay the same since hist_stats will not change either.

You can control the number of episodes in hist_stats with the reporting argument metrics_num_episodes_for_smoothing.

https://docs.ray.io/en/latest/rllib/package_ref/doc/ray.rllib.algorithms.algorithm_config.AlgorithmConfig.reporting.html

What are you trying to accomplish with the truncation? Unless you have some specific need, like for example your environment never terminates, or you want to ensure a maximum length to for example timeout if an agent gets stuck, there is no need to truncate an episode. In many cases, variable length episodes and episodes of unknown length work just fine in RLLIB and PPO without any special treatment.