Evaluation of PPO agent fails due to wrongly shaped actions

nudlsrl · October 8, 2024, 1:23pm

Hi everyone,

I’m currently trying to get a PPO agent to work with a custom environment, specifically based on:

I’ve managed to run the training process successfully, but I run into an error as soon as I try to run the evaluation of my agent. I’ve already spent a few hours trying to debug the issue, so I hope to be able to provide a lot of useful information that might help finding the cause of my issue.

My script crashes due to the following exception being raised:

ValueError: The two structures don't have the same nested structure.
First structure: type=list str=[3, 0, 2, 0, 0]
Second structure: type=MultiDiscrete str=MultiDiscrete([ 4 16 16  1  1])

Here is the complete stack trace:

I’m working with Ray 2.36.0, PPO algorithm with new API stack enabled and I’m not relying on any custom model or connectors, thus only relying on the default stack. My environment exposes a MultiDiscrete([4, 16, 16, 1, 1]) action space. The observation space does not appear to be relevant for my issue.

Since the issue does not arise during training, I’ve compared what happens there to find any discrepancy that might highlight the underlying cause of the issue.
By doing so, I’ve discovered that the culprit of the issue lies in the action sampling process that takes place in the default module-to-env connector pipeline. Specifically, the discrepancy arises when the GetActions connector is called.

During training, the connector samples the actions correctly, returning instances of pytorch.tensor, e.g: tensor([0, 3, 2, 0, 0]), which do not raise any issue when further processed by the connector pipeline since they are correctly converted to numpy.ndarray by the TensorToNumpy connector.

However, during evaluation, the sampled actions are returned in the form of a pure Python nested list of singleton tensors, e.g: [tensor[0], tensor[3], tensor[2], tensor[0], tensor[0]], which are then converted to a pure python list of values, which leads to the above error when being processed by the NormalizeAndClipActions connector down the pipeline.

The reason behind this discrepancy lies in the different value taken by the explore argument in the calls to the connector. During training we have explore=True, while during evaluation we have explore=False. This difference lead to a subtle change in how the connector works.

In both cases, the GetActions connector creates an ActionDistribution, specifically a TorchMultiCategorical distribution and uses it to sample the actions. However, then running evalution, i.e: expore=False, the connector first converts such distribution to a deterministic one, via the following code:

            if not explore:
                action_dist = action_dist.to_deterministic()

After this is executed, action_dist is now an instance of TorchMultiDistribution, which returns its samples in the abovementioned problematic format.

This is quite frustrating, as it feels like this is not something I have control over, since this looks like the intended behaviour of the library code involved. On the other hand, if this is the case, I suppose there must necessarily be something I’m missing and/or doing wrong.

Thanks in advance for any suggestions and kind help!

mannyv · October 8, 2024, 4:37pm

Hi @nudlsrl,

I do not have a fix for you but I have good news none the less. PPO trains an online stochastic policy. When you have a trained stochastic policy, the action policy continues to remain optimal only under the stochastic assumption. The correct way to evaluate a policy trained with explore=True is to also evaluate it in the same way. Many users have reported that violating this assumption, changing explore=False, leads to worse returns in evaluation.

nudlsrl · October 8, 2024, 11:37pm

Hi @mannyv!

first of all, thanks for your swift response! From a practical standpoint, this does indeed solve my issue, as there would be no reason for me to disable exploration given the theoretical considerations you provided! It seems like I need to review my theory a bit

Topic		Replies	Views
Undestanding the expected output shapes of a Recurrent model with Dict Action Space Configure Algorithm, Training, Evaluation, Scaling	2	292	January 15, 2024
Multiagent environment crashes when more than two agents are initiated RLlib	1	225	April 17, 2023
MultiAgentEnv works with PPO.train() but not tune.Tuner.fit() RLlib	1	194	December 5, 2023
Help with ppo config in multiagent env with complex observations Configure Algorithm, Training, Evaluation, Scaling	0	37	April 11, 2025
Size mismatch when training PPO in pettingzoo environment RLlib	0	417	February 21, 2022

Evaluation of PPO agent fails due to wrongly shaped actions

Related topics