Unexpected dramatic drop in reward

Hi all,
I trained a PPO agent for my custom env. Everything was working pretty well but suddenly the reward dropped and never recovered again. As you see in the following figure the agent almost perfectly learned my env (10 is the maximum reward in my custom env).

I expected the reward curve gets plateau somewhere after the blue line, but as you see in the following figure it dramatically dropped!

I wonder do you think is it a Ray/RLlib’s issue? or could be related to my CUDA or something else?

Thanks!

1 Like

Is it possible in your custom environment for the agents to receive such low rewards within an episode? If not, then you’re probably seeing a bug.

1 Like

@rusu24edward , thanks for your reply

Hey @deepgravity , could you check your model’s weights? Maybe they have collapsed/exploded/NaN’d after some learning update?

Hi @sven1977 , thanks for your reply. I am actually no longer faced with this issue. But I do not know how I fixed it. I indeed changed many things in my custom env and also agent training pipeline. So, not sure what was the main reason for the error. Anyway, now everything works pretty well :slight_smile:

Hi @deepgravity , I am running into the same issue you described here. Would it be possible for you to share some some of your configs on here? PPO config and Ray Tune config. Maybe a small reproducer? Thanks so much!

Hi @max_ronda ,

Here is my code. I hope this helps, otherwise, pls feel free to ask more questions.

# -*- coding: utf-8 -*-
"""
Created on Sun Sep 12 09:35:41 2021

@author: Reza Kakooee
"""

# %%
import os

import ray
from ray import tune
from ray import air
from ray.rllib.algorithms.algorithm import Algorithm
from ray.tune.registry import get_trainable_cls

from ray.air.integrations.wandb import setup_wandb
from ray.air.integrations.wandb import WandbLoggerCallback

from ray.rllib.utils.replay_buffers.replay_buffer import StorageUnit

from learner_config import LearnerConfig

from callbacks import EnvInfoCallback

from gym_floorplan.envs.master_env import SpaceLayoutGym


# %%
class Tunner:
    def __init__(self, fenv_config, agent_config):
        self.fenv_config = fenv_config
        self.env_name = self.fenv_config['env_name']

        self.agent_config = agent_config

        self.algo_cls = get_trainable_cls(self.agent_config['agent_first_name'])
        
        self.param_space = (
            self.algo_cls
                .get_default_config()
                .environment(SpaceLayoutGym, env_config=self.fenv_config)
                .framework(self.agent_config['framework'])
                .rollouts(num_rollout_workers=self.agent_config['num_rollout_workers'])
                .resources(num_gpus=self.agent_config['RLLIB_NUM_GPUS'])
                .training(_enable_learner_api=False)
                .rl_module(_enable_rl_module_api=False)
        )
        
        if self.agent_config['save_env_data_flag']:
            self.param_space.output = self.agent_config['env_data_dir']
            self.param_space.output_max_file_size = 5000000

        stop = {"training_iteration": self.agent_config['stop_tunner_iteration']}
        self.run_config = air.RunConfig(
            stop=stop,
            local_dir=self.agent_config['scenario_dir'],
            checkpoint_config=air.CheckpointConfig(checkpoint_at_end=True,
                                                   checkpoint_frequency=self.agent_config['checkpoint_frequency']),
            callbacks=[WandbLoggerCallback(project=self.agent_config['project_name'],
                                           group=self.agent_config['group_name'])],
            verbose=2, #get_air_verbosity(AirVerbosity.DEFAULT),
            )
        
        
    def tune(self, save_outputs=True):
        tuner = tune.Tuner(
            self.agent_config['agent_first_name'],
            run_config=self.run_config,
            param_space=self.param_space,
        )

        if self.agent_config['load_agent_flag']:
            tuner.restore(self.agent_config['old_model_path'], self.agent_config['old_agent_first_name'])
            
        results = tuner.fit()

Hi @deepgravity , thanks for the response! Which parameters you think affect this reward drop ? I tried running with torch and disabling _enable_learner_api and _enable_rl_module_api but no luck . Do you have an environment that can be tested with this issue? Which version of ray are you using? Thanks!

Hi @max_ronda,

Honestly, I don’t remember what happened at the time; but I don’t think it was related to _enable_learner_api and _enable_rl_module_api

It might be something related to your custom env (if you have one); or the way you configured your rllib agents.

I am currently using Ray 2.6 and even newer versions on another device.

I have already open-sourced an older version of my custom env, and the way I trained my Rllib agents.

The code I sent you a few days ago, does not work on this repo, because that was for ray 2.6. But in this repo, I used ray 1.x.

I hope this helps