Unexpected dramatic drop in reward

deepgravity · December 6, 2021, 9:15am

Hi all,
I trained a PPO agent for my custom env. Everything was working pretty well but suddenly the reward dropped and never recovered again. As you see in the following figure the agent almost perfectly learned my env (10 is the maximum reward in my custom env).

I expected the reward curve gets plateau somewhere after the blue line, but as you see in the following figure it dramatically dropped!

I wonder do you think is it a Ray/RLlib’s issue? or could be related to my CUDA or something else?

Thanks!

rusu24edward · December 28, 2021, 8:10pm

Is it possible in your custom environment for the agents to receive such low rewards within an episode? If not, then you’re probably seeing a bug.

deepgravity · January 24, 2022, 2:50pm

@rusu24edward , thanks for your reply

sven1977 · January 26, 2022, 3:34pm

Hey @deepgravity , could you check your model’s weights? Maybe they have collapsed/exploded/NaN’d after some learning update?

deepgravity · January 26, 2022, 9:49pm

Hi @sven1977 , thanks for your reply. I am actually no longer faced with this issue. But I do not know how I fixed it. I indeed changed many things in my custom env and also agent training pipeline. So, not sure what was the main reason for the error. Anyway, now everything works pretty well

max_ronda · November 9, 2023, 7:14pm

Hi @deepgravity , I am running into the same issue you described here. Would it be possible for you to share some some of your configs on here? PPO config and Ray Tune config. Maybe a small reproducer? Thanks so much!

deepgravity · November 10, 2023, 12:09am

Hi @max_ronda ,

Here is my code. I hope this helps, otherwise, pls feel free to ask more questions.

# -*- coding: utf-8 -*-
"""
Created on Sun Sep 12 09:35:41 2021

@author: Reza Kakooee
"""

# %%
import os

import ray
from ray import tune
from ray import air
from ray.rllib.algorithms.algorithm import Algorithm
from ray.tune.registry import get_trainable_cls

from ray.air.integrations.wandb import setup_wandb
from ray.air.integrations.wandb import WandbLoggerCallback

from ray.rllib.utils.replay_buffers.replay_buffer import StorageUnit

from learner_config import LearnerConfig

from callbacks import EnvInfoCallback

from gym_floorplan.envs.master_env import SpaceLayoutGym


# %%
class Tunner:
    def __init__(self, fenv_config, agent_config):
        self.fenv_config = fenv_config
        self.env_name = self.fenv_config['env_name']

        self.agent_config = agent_config

        self.algo_cls = get_trainable_cls(self.agent_config['agent_first_name'])
        
        self.param_space = (
            self.algo_cls
                .get_default_config()
                .environment(SpaceLayoutGym, env_config=self.fenv_config)
                .framework(self.agent_config['framework'])
                .rollouts(num_rollout_workers=self.agent_config['num_rollout_workers'])
                .resources(num_gpus=self.agent_config['RLLIB_NUM_GPUS'])
                .training(_enable_learner_api=False)
                .rl_module(_enable_rl_module_api=False)
        )
        
        if self.agent_config['save_env_data_flag']:
            self.param_space.output = self.agent_config['env_data_dir']
            self.param_space.output_max_file_size = 5000000

        stop = {"training_iteration": self.agent_config['stop_tunner_iteration']}
        self.run_config = air.RunConfig(
            stop=stop,
            local_dir=self.agent_config['scenario_dir'],
            checkpoint_config=air.CheckpointConfig(checkpoint_at_end=True,
                                                   checkpoint_frequency=self.agent_config['checkpoint_frequency']),
            callbacks=[WandbLoggerCallback(project=self.agent_config['project_name'],
                                           group=self.agent_config['group_name'])],
            verbose=2, #get_air_verbosity(AirVerbosity.DEFAULT),
            )
        
        
    def tune(self, save_outputs=True):
        tuner = tune.Tuner(
            self.agent_config['agent_first_name'],
            run_config=self.run_config,
            param_space=self.param_space,
        )

        if self.agent_config['load_agent_flag']:
            tuner.restore(self.agent_config['old_model_path'], self.agent_config['old_agent_first_name'])
            
        results = tuner.fit()

max_ronda · November 10, 2023, 9:13pm

Hi @deepgravity , thanks for the response! Which parameters you think affect this reward drop ? I tried running with torch and disabling _enable_learner_api and _enable_rl_module_api but no luck . Do you have an environment that can be tested with this issue? Which version of ray are you using? Thanks!

deepgravity · November 13, 2023, 9:42pm

Hi @max_ronda,

Honestly, I don’t remember what happened at the time; but I don’t think it was related to _enable_learner_api and _enable_rl_module_api

It might be something related to your custom env (if you have one); or the way you configured your rllib agents.

I am currently using Ray 2.6 and even newer versions on another device.

I have already open-sourced an older version of my custom env, and the way I trained my Rllib agents.

The code I sent you a few days ago, does not work on this repo, because that was for ray 2.6. But in this repo, I used ray 1.x.

I hope this helps

Topic		Replies	Views
Episode Reward Drops Without Recovery RLlib	0	173	November 9, 2023
Unable to replicate original PPO performance RLlib	0	177	May 10, 2024
Ray RLLIB PPO does not solve very simple problem Configure Algorithm, Training, Evaluation, Scaling	2	466	November 8, 2023
PPO.train incorrect result RLlib	1	260	May 23, 2023
No Reward Appearing for MARL Environment during Training	5	1267	April 10, 2021

Unexpected dramatic drop in reward

Related topics