Agent consistently stops improving at the same point, despite not appearing to be in a local maxima

Overview:

1. Severity of the issue: (select one)
High: Completely blocks me.

2. Environment:

  • Ray version: 3.0.0
  • Python version: 3.9.22
  • OS: Ubuntu 22.04
  • Cloud/Infrastructure: LambdaLabs 1x GH200 (64 virtual CPUs)

3. What happened vs. what you expected:

While training a model on a challenging but tractable task using PPO, my agent stopped improving 200 iterations into a 1,000 iteration training run. This behavior is consistent across attempts.

  • Expected: Since the observed policy when improvement ceases is not too far from a successful one, it’s strange that it consistently fails to improve past this point.
  • Actual: The reward remains fixed around an average of 30 (with a reward of 50 being consistently achievable when I tested the environment). The policy does not appear to be at a local maxima.

Additional information which may be relevant is included below, including metrics tracked over the course of the run.

Detailed Information (including code):

Background:

My target environment consists of a ‘spaceship’, a ‘star’ with gravitational force that it must avoid and account for, and a set of five targets that it must hit by launching a limited set of projectiles. My agent is a default PPO agent, with the exception of an attention-based encoder with design matching the architecture used here. The training run is carried out for 1,000 epochs with a batch size of 32,768 steps and a minibatch size of 4,096 steps.

My immediate goal is to train an agent to accomplish a non-trivial task in a custom environment through use of a custom architecture. Videos of the environment are below, and the full code for my experiment and my testing suite can be found here. The command I used to run training is:

python run_training.py --env-name SW_MultiShoot_Env --env-config '{"speed": 2.0, "ep_length": 256}' --stop-iters=1000 --num-env-runners 60 --checkpoint-freq 100 --checkpoint-at-end

As a brief aside, I’ve been working with RLlib on various projects for a while now, but it’s been tricky to find case studies on the subtle challenges associated with solving difficult environments that require extended training runs and tricky architectural configurations. In particular, I get the sense that I should be leveraging Tune more effectively than I presently am. My goal, at the conclusion of this project, is to release an extended tutorial outlining the process by which a novel, challenging reinforcement learning task can be solved from start to finish using RLlib, including all of the small details that can seem impenetrable when first starting out. To that ends, I would be very grateful to anyone pointing out faulty assumptions I’m making or design flaws in my codebase so that I can improve my workflow and share out a better final product.

Problem:

My agent learns well up until 200 iterations, after which it seems to stop meaningfully learning. Mean reward stalls, and the agent makes no further improvements to its performance along any axis.

I’ve tried this environment myself, and had no issue getting the maximum reward. Qualitatively, the learned policy doesn’t seem to be in a local maxima. It’s visibly making an effort to achieve the task, and its failures are due to imprecise control rather than a fundamental misunderstanding of the optimal policy. It makes use of all of the environment’s mechanics to try to achieve its goal, and appears to only need to refine itself a little bit to solve the task. As far as I can tell, the point in policy-space that it inhabits is an ideal place for a reinforcement learning agent to be, aside from the fact that it gets stuck there and does not continue improving.

Analysis and Attempts to Diagnose:

Looking at trends in metrics, I see that value function loss declines precipitously after the point it stops learning, with explained_var increasing commensurately. There’s no noticeable change in policy loss throughout, and while KL divergence loss does fall off sharply, this happens later on. The global norm of the default optimizer’s gradients increases enormously after hitting the point where learning stops, which seems significant, though I’m not quite sure what the implication is.

In the console, I observe an error around the point at which learning stops that says:

NaN or Inf found in input tensor.

It’s repeated twice, so I think it’s showing up somewhere on the learner node rather than in any of the envrunners. I’ve triple-checked the environment to ensure that it’s never providing observations outside of the [-1,1] range that it should be. Training continues without anything outright breaking, and I’m not completely certain this is related to the problem at hand, but I’d still like to figure out what’s causing this.

Images:

Relevant graphs

Mean episode return, capping out at 30 around epoch 200.

https://i.snipboard.io/QMjZo1.jpg

Value function loss falls sharply at this point. Not necessarily unexpected, given that the policy stops improving.

https://i.snipboard.io/v8su9N.jpg

Global norm of default optimizer’s gradients. I’ll admit that I’m not entirely certain of how relevant this is, but it increases tremendously around the time the agent stops improving.

https://i.snipboard.io/BIU9jO.jpg

Videos:

Successful manual completion of environment:

https://i.giphy.com/Hc5p0mDh7YDdl4Ys4w.webp

Agent's final policy:

https://i.giphy.com/5rvG1VfNomE61UJh2y.webp