Hi everyone,
I’m currently trying to get a PPO agent to work with a custom environment, specifically based on:
I’ve managed to run the training process successfully, but I run into an error as soon as I try to run the evaluation of my agent. I’ve already spent a few hours trying to debug the issue, so I hope to be able to provide a lot of useful information that might help finding the cause of my issue.
My script crashes due to the following exception being raised:
ValueError: The two structures don't have the same nested structure.
First structure: type=list str=[3, 0, 2, 0, 0]
Second structure: type=MultiDiscrete str=MultiDiscrete([ 4 16 16 1 1])
Here is the complete stack trace:
I’m working with Ray 2.36.0, PPO algorithm with new API stack enabled and I’m not relying on any custom model or connectors, thus only relying on the default stack. My environment exposes a MultiDiscrete([4, 16, 16, 1, 1]) action space. The observation space does not appear to be relevant for my issue.
Since the issue does not arise during training, I’ve compared what happens there to find any discrepancy that might highlight the underlying cause of the issue.
By doing so, I’ve discovered that the culprit of the issue lies in the action sampling process that takes place in the default module-to-env connector pipeline. Specifically, the discrepancy arises when the GetActions connector is called.
During training, the connector samples the actions correctly, returning instances of pytorch.tensor, e.g: tensor([0, 3, 2, 0, 0]), which do not raise any issue when further processed by the connector pipeline since they are correctly converted to numpy.ndarray by the TensorToNumpy connector.
However, during evaluation, the sampled actions are returned in the form of a pure Python nested list of singleton tensors, e.g: [tensor[0], tensor[3], tensor[2], tensor[0], tensor[0]], which are then converted to a pure python list of values, which leads to the above error when being processed by the NormalizeAndClipActions connector down the pipeline.
The reason behind this discrepancy lies in the different value taken by the explore argument in the calls to the connector. During training we have explore=True, while during evaluation we have explore=False. This difference lead to a subtle change in how the connector works.
In both cases, the GetActions connector creates an ActionDistribution, specifically a TorchMultiCategorical distribution and uses it to sample the actions. However, then running evalution, i.e: expore=False, the connector first converts such distribution to a deterministic one, via the following code:
if not explore:
action_dist = action_dist.to_deterministic()
After this is executed, action_dist is now an instance of TorchMultiDistribution, which returns its samples in the abovementioned problematic format.
This is quite frustrating, as it feels like this is not something I have control over, since this looks like the intended behaviour of the library code involved. On the other hand, if this is the case, I suppose there must necessarily be something I’m missing and/or doing wrong.
Thanks in advance for any suggestions and kind help!