Continuous + discrete action space and custom model

Hello, I am trying to build a model that predicts a target (discrete action space) and how much units to commit (continuous). Earlier I used discrete action space to choose how many units to commit for simplicity (0% of units, 10%, 20% … 100%) but it could harm the performance. I tried to predict continuous actions using mean and std distribution (though i haven’t found a clear example), however, my agent stopped learning at all. I use PPO and explained_var started steadily drop to -1, agent stopped to act at all (stalling the whole game) and entropy was somewhere in the sky at 3.4. I can’t really diagnose this…

This is my attempt, can anybody say whether it is correct or not?

My action space is

self.action_spaces = {

            agent: spaces.Dict(

                {

"target": spaces.Discrete(self.game.n_players + 1),

"commit": spaces.Box(

low=0.0, high=1.0, shape=(1,), dtype=np.float32

                    ),

                }

            )

for agent in self.possible_agents

        }

Could the model be the issue? Thanks a lot in advance!

Your approach of using a Dict action space with both a discrete (“target”) and continuous (“commit”) component is correct in principle, and RLlib supports this setup. However, your model and action distribution handling may be causing the learning issues you observe (e.g., explained_var dropping to -1, high entropy, agent stalling).

Key points from your code and RLlib requirements:

  • For a Dict action space, your model’s forward output must concatenate logits for the discrete action and parameters (mean, log_std) for the continuous action, in the correct order and shape.
  • For the continuous “commit” action, you must ensure the log_std is clamped to a reasonable range (e.g., [-5, 2]) to avoid NaNs and exploding entropy, as very negative log_std leads to near-zero std and numerical instability. This is a common cause of the symptoms you describe (PPO nan in actor logits RLlib, github gist example).
  • The action mask should only be applied to the discrete logits, not to the continuous parameters.
  • Make sure your model output matches the expected format: [discrete_logits..., commit_mean, commit_log_std].

A correct approach is shown in your gist and in RLlib examples, but you must clamp the log_std and ensure the action mask is only applied to the discrete part:

logits = self.policy_head(features)
target_logits = logits[..., :self.target_dim]
commit_mu = logits[..., self.target_dim]
commit_log_std = torch.clamp(logits[..., self.target_dim + 1], -5, 2)  # Clamp log_std
commit_params = torch.stack([commit_mu, commit_log_std], dim=-1)
inf_mask = torch.clamp(torch.log(action_mask), min=-1e20)
masked_target_logits = target_logits + inf_mask
masked_logits = torch.cat([commit_params, masked_target_logits], dim=-1)
self._value_out = self.value_head(features).squeeze(-1)
return masked_logits, state

See gist.github.com example for reference.

Would you like a step-by-step breakdown of how to structure your model and action distributions for this setup?

Sources:

Hint: Mention @Herald in the post for followups.

forgot to say that i predict value between 0 and 1: how much % of units to commit. doesn’t clamping mean that to [0,1] inside rllib break the gradients? or i should explicitly clamp inside the model