Continuous + discrete action space and custom model

unleex · June 22, 2026, 7:31pm

Hello, I am trying to build a model that predicts a target (discrete action space) and how much units to commit (continuous). Earlier I used discrete action space to choose how many units to commit for simplicity (0% of units, 10%, 20% … 100%) but it could harm the performance. I tried to predict continuous actions using mean and std distribution (though i haven’t found a clear example), however, my agent stopped learning at all. I use PPO and explained_var started steadily drop to -1, agent stopped to act at all (stalling the whole game) and entropy was somewhere in the sky at 3.4. I can’t really diagnose this…

This is my attempt, can anybody say whether it is correct or not?

gist.github.com

https://gist.github.com/unleex/fd5c908038d72e14563e5da42ddcf000

gistfile1.txt


class ActionMaskModel(TorchModelV2, nn.Module, BasePlayer):
    def __init__(self, obs_space, action_space, num_outputs, model_config, name):
        TorchModelV2.__init__(
            self, obs_space, action_space, num_outputs, model_config, name
        )
        nn.Module.__init__(self)

        original_space = getattr(obs_space, "original_space", obs_space)
        self.target_dim = int(original_space["action_mask"].shape[0])

This file has been truncated. show original

My action space is

self.action_spaces = {

            agent: spaces.Dict(

                {

"target": spaces.Discrete(self.game.n_players + 1),

"commit": spaces.Box(

low=0.0, high=1.0, shape=(1,), dtype=np.float32

                    ),

                }

            )

for agent in self.possible_agents

        }

Could the model be the issue? Thanks a lot in advance!

RunLLM · June 22, 2026, 7:32pm

Your approach of using a Dict action space with both a discrete (“target”) and continuous (“commit”) component is correct in principle, and RLlib supports this setup. However, your model and action distribution handling may be causing the learning issues you observe (e.g., explained_var dropping to -1, high entropy, agent stalling).

Key points from your code and RLlib requirements:

For a Dict action space, your model’s forward output must concatenate logits for the discrete action and parameters (mean, log_std) for the continuous action, in the correct order and shape.
For the continuous “commit” action, you must ensure the log_std is clamped to a reasonable range (e.g., [-5, 2]) to avoid NaNs and exploding entropy, as very negative log_std leads to near-zero std and numerical instability. This is a common cause of the symptoms you describe (PPO nan in actor logits RLlib, github gist example).
The action mask should only be applied to the discrete logits, not to the continuous parameters.
Make sure your model output matches the expected format: [discrete_logits..., commit_mean, commit_log_std].

A correct approach is shown in your gist and in RLlib examples, but you must clamp the log_std and ensure the action mask is only applied to the discrete part:

logits = self.policy_head(features)
target_logits = logits[..., :self.target_dim]
commit_mu = logits[..., self.target_dim]
commit_log_std = torch.clamp(logits[..., self.target_dim + 1], -5, 2)  # Clamp log_std
commit_params = torch.stack([commit_mu, commit_log_std], dim=-1)
inf_mask = torch.clamp(torch.log(action_mask), min=-1e20)
masked_target_logits = target_logits + inf_mask
masked_logits = torch.cat([commit_params, masked_target_logits], dim=-1)
self._value_out = self.value_head(features).squeeze(-1)
return masked_logits, state

See gist.github.com example for reference.

Would you like a step-by-step breakdown of how to structure your model and action distributions for this setup?

Sources:

github gist example

Hint: Mention @Herald in the post for followups.

unleex · June 24, 2026, 5:08pm

forgot to say that i predict value between 0 and 1: how much % of units to commit. doesn’t clamping mean that to [0,1] inside rllib break the gradients? or i should explicitly clamp inside the model

Topic		Replies	Views
Continuous action space and custom model RLlib	4	1692	July 17, 2021
Is any multi discrete action example for PPO or other algorithms? RLlib	9	4633	January 29, 2023
There was an error changing the trajecy_tory_view_api into continuous action space RLlib	7	653	February 22, 2022
Observation dependent continuous action space ("Masking" continuous action space) RLlib	3	1184	February 9, 2022
Continuous action space Configure Algorithm, Training, Evaluation, Scaling	2	114	July 29, 2024

Continuous + discrete action space and custom model

Related topics