Agent consistently stops improving at the same point, despite not appearing to be in a local maxima

Overview:

1. Severity of the issue: (select one)
High: Completely blocks me.

2. Environment:

  • Ray version: 3.0.0
  • Python version: 3.9.22
  • OS: Ubuntu 22.04
  • Cloud/Infrastructure: LambdaLabs 1x GH200 (64 virtual CPUs)

3. What happened vs. what you expected:

While training a model on a challenging but tractable task using PPO, my agent stopped improving 200 iterations into a 1,000 iteration training run. This behavior is consistent across attempts.

  • Expected: Since the observed policy when improvement ceases is not too far from a successful one, it’s strange that it consistently fails to improve past this point.
  • Actual: The reward remains fixed around an average of 30 (with a reward of 50 being consistently achievable when I tested the environment). The policy does not appear to be at a local maxima.

Additional information which may be relevant is included below, including metrics tracked over the course of the run.

Detailed Information (including code):

Background:

My target environment consists of a ‘spaceship’, a ‘star’ with gravitational force that it must avoid and account for, and a set of five targets that it must hit by launching a limited set of projectiles. My agent is a default PPO agent, with the exception of an attention-based encoder with design matching the architecture used here. The training run is carried out for 1,000 epochs with a batch size of 32,768 steps and a minibatch size of 4,096 steps.

My immediate goal is to train an agent to accomplish a non-trivial task in a custom environment through use of a custom architecture. Videos of the environment are below, and the full code for my experiment and my testing suite can be found here. The command I used to run training is:

python run_training.py --env-name SW_MultiShoot_Env --env-config '{"speed": 2.0, "ep_length": 256}' --stop-iters=1000 --num-env-runners 60 --checkpoint-freq 100 --checkpoint-at-end

As a brief aside, I’ve been working with RLlib on various projects for a while now, but it’s been tricky to find case studies on the subtle challenges associated with solving difficult environments that require extended training runs and tricky architectural configurations. In particular, I get the sense that I should be leveraging Tune more effectively than I presently am. My goal, at the conclusion of this project, is to release an extended tutorial outlining the process by which a novel, challenging reinforcement learning task can be solved from start to finish using RLlib, including all of the small details that can seem impenetrable when first starting out. To that ends, I would be very grateful to anyone pointing out faulty assumptions I’m making or design flaws in my codebase so that I can improve my workflow and share out a better final product.

Problem:

My agent learns well up until 200 iterations, after which it seems to stop meaningfully learning. Mean reward stalls, and the agent makes no further improvements to its performance along any axis.

I’ve tried this environment myself, and had no issue getting the maximum reward. Qualitatively, the learned policy doesn’t seem to be in a local maxima. It’s visibly making an effort to achieve the task, and its failures are due to imprecise control rather than a fundamental misunderstanding of the optimal policy. It makes use of all of the environment’s mechanics to try to achieve its goal, and appears to only need to refine itself a little bit to solve the task. As far as I can tell, the point in policy-space that it inhabits is an ideal place for a reinforcement learning agent to be, aside from the fact that it gets stuck there and does not continue improving.

Analysis and Attempts to Diagnose:

Looking at trends in metrics, I see that value function loss declines precipitously after the point it stops learning, with explained_var increasing commensurately. There’s no noticeable change in policy loss throughout, and while KL divergence loss does fall off sharply, this happens later on. The global norm of the default optimizer’s gradients increases enormously after hitting the point where learning stops, which seems significant, though I’m not quite sure what the implication is.

In the console, I observe an error around the point at which learning stops that says:

NaN or Inf found in input tensor.

It’s repeated twice, so I think it’s showing up somewhere on the learner node rather than in any of the envrunners. I’ve triple-checked the environment to ensure that it’s never providing observations outside of the [-1,1] range that it should be. Training continues without anything outright breaking, and I’m not completely certain this is related to the problem at hand, but I’d still like to figure out what’s causing this.

Images:

Relevant graphs

Mean episode return, capping out at 30 around epoch 200.

https://i.snipboard.io/QMjZo1.jpg

Value function loss falls sharply at this point. Not necessarily unexpected, given that the policy stops improving.

https://i.snipboard.io/v8su9N.jpg

Global norm of default optimizer’s gradients. I’ll admit that I’m not entirely certain of how relevant this is, but it increases tremendously around the time the agent stops improving.

https://i.snipboard.io/BIU9jO.jpg

Videos:

Successful manual completion of environment:

https://i.giphy.com/Hc5p0mDh7YDdl4Ys4w.webp

Agent's final policy:

https://i.giphy.com/5rvG1VfNomE61UJh2y.webp

An update:

I’ve been very curious as to what’s going wrong here, and I’ve been working on and off for the past few weeks, rerunning my experiment with a number of architectural and hyperparameter tweaks to see if I could get a better angle on the problem. I’ve figured out answers to a few of the above questions that might be helpful to future users, but unfortunately I haven’t been able to get it working. Any advice would be extremely appreciated.

I found explanations for two of the things mentioned in the original post:

I (belatedly) noticed in the original version that there were instances of infinity loss reported. They all occur after it stalled (the earliest at epoch 326), and come from the KL divergence loss term, with no similar issues in the policy term. That explains the error message, meaning it had nothing to do with the issue at hand - likely a policy produced a zero probability for an action at some point after learning stalled. Thought I should include that in case anyone in the future gets a similar error message.

Running several experiments, I found that the sharp drop in VF loss generally precedes the point at which reward falls off. My intuition is that this is broadly how things are supposed to work in the general case - the value function learns what situations are good or bad in the context of the current policy, and these predictions are easier to make as the policy changes less radically. That said, there was always a massive, nearly vertical drop at some point, which had me a little wary. I ended up solving this by increasing the value function clip threshold, which resulted in a smoother decrease, but didn’t improve learning performance.

Further work:

Following on from this, I ran a number of experiments. Concluding that the architecture of the critic was impeding its ability to model the relationship between projectiles and targets, I generated a large number of rollouts with the best final policy I had on hand, and tested a number of architectures’ ability to minimize the value loss of the agent under a fixed policy, aiming to capture an architecture’s ability to effectively model the dynamics of the environment. The best results I was able to achieve involved reengineered the encoder to apply the attention layer twice recursively and reducing the embedding size significantly to alleviate overfitting.

Most recently, I ran an experiment with the learning rate turned up significantly, and ended up with a relatively successful (87% explained variance) but unstable critic and a policy that while about as effective as any of the others I trained, seemed to consistently waste its projectiles on shots that had no chance of hitting their targets.

I feel like I’m missing something fundamental, here. It’s definitely possible to consistently achieve a perfect score of 50, but every policy I train has a loss curve that looks like a horizontal asymptote with a limit in the neighborhood of 30 ± 5.

Question:

Below are the key metrics from my best run so far, with the architecture described above. The encoder consisted of one transformer encoder layer of dimension 16 applied twice recursively over embeddings of each object (the agent’s ship, each projectile, and each target). The heads each consisted of a single layer of 64 hidden units.

Hyperparameters were as follows:

lr = 1e-6
gamma = .999
vf_clip_param=ifty
batch_size=32768
minibatch_size=4096

The observation space, just in case, looks like this:

Agent: [X, Y] position, [X, Y] velocity, [X, Y] of angle's unit vector, [projectiles_left / max]
    
Targets: Repeated(5) x ([X, Y] position) 
    
Projectiles: Repeated(5) x ([X, Y] position, [X, Y] velocity, remaining_fuel / max)

Are there any fundamental mistakes that I’m making, here, that would result in the asymptotic behavior observed? I’ve tried everything I can think of, and consistently observe the same pattern.

As a (belated) conclusion, I was able to get the training to a reasonable success rate through the following:

  • First, I adjusted the learning rate to pare down by an order of magnitude when reward stabilized.
  • Second, I implemented some basic reward-shaping, in the form of a +5 bonus when all targets had been hit. I hadn’t wanted to use any reward shaping initially, but this doesn’t impose any assumptions on how the problem should be solved, and only serves to underscore the importance of solving it in its entirety.

I hope this information helps anyone who might run into this post through a search engine after facing the same issues.