Your approach of using a Dict action space with both a discrete (“target”) and continuous (“commit”) component is correct in principle, and RLlib supports this setup. However, your model and action distribution handling may be causing the learning issues you observe (e.g., explained_var dropping to -1, high entropy, agent stalling).
Key points from your code and RLlib requirements:
- For a Dict action space, your model’s forward output must concatenate logits for the discrete action and parameters (mean, log_std) for the continuous action, in the correct order and shape.
- For the continuous “commit” action, you must ensure the log_std is clamped to a reasonable range (e.g., [-5, 2]) to avoid NaNs and exploding entropy, as very negative log_std leads to near-zero std and numerical instability. This is a common cause of the symptoms you describe (PPO nan in actor logits RLlib, github gist example).
- The action mask should only be applied to the discrete logits, not to the continuous parameters.
- Make sure your model output matches the expected format:
[discrete_logits..., commit_mean, commit_log_std].
A correct approach is shown in your gist and in RLlib examples, but you must clamp the log_std and ensure the action mask is only applied to the discrete part:
logits = self.policy_head(features)
target_logits = logits[..., :self.target_dim]
commit_mu = logits[..., self.target_dim]
commit_log_std = torch.clamp(logits[..., self.target_dim + 1], -5, 2) # Clamp log_std
commit_params = torch.stack([commit_mu, commit_log_std], dim=-1)
inf_mask = torch.clamp(torch.log(action_mask), min=-1e20)
masked_target_logits = target_logits + inf_mask
masked_logits = torch.cat([commit_params, masked_target_logits], dim=-1)
self._value_out = self.value_head(features).squeeze(-1)
return masked_logits, state
See gist.github.com example for reference.
Would you like a step-by-step breakdown of how to structure your model and action distributions for this setup?
Sources:
Hint: Mention @Herald in the post for followups.