Continuous + discrete action space and custom model

RunLLM · June 22, 2026, 7:32pm

Your approach of using a Dict action space with both a discrete (“target”) and continuous (“commit”) component is correct in principle, and RLlib supports this setup. However, your model and action distribution handling may be causing the learning issues you observe (e.g., explained_var dropping to -1, high entropy, agent stalling).

Key points from your code and RLlib requirements:

For a Dict action space, your model’s forward output must concatenate logits for the discrete action and parameters (mean, log_std) for the continuous action, in the correct order and shape.
For the continuous “commit” action, you must ensure the log_std is clamped to a reasonable range (e.g., [-5, 2]) to avoid NaNs and exploding entropy, as very negative log_std leads to near-zero std and numerical instability. This is a common cause of the symptoms you describe (PPO nan in actor logits RLlib, github gist example).
The action mask should only be applied to the discrete logits, not to the continuous parameters.
Make sure your model output matches the expected format: [discrete_logits..., commit_mean, commit_log_std].

A correct approach is shown in your gist and in RLlib examples, but you must clamp the log_std and ensure the action mask is only applied to the discrete part:

logits = self.policy_head(features)
target_logits = logits[..., :self.target_dim]
commit_mu = logits[..., self.target_dim]
commit_log_std = torch.clamp(logits[..., self.target_dim + 1], -5, 2)  # Clamp log_std
commit_params = torch.stack([commit_mu, commit_log_std], dim=-1)
inf_mask = torch.clamp(torch.log(action_mask), min=-1e20)
masked_target_logits = target_logits + inf_mask
masked_logits = torch.cat([commit_params, masked_target_logits], dim=-1)
self._value_out = self.value_head(features).squeeze(-1)
return masked_logits, state

See gist.github.com example for reference.

Would you like a step-by-step breakdown of how to structure your model and action distributions for this setup?

Sources:

github gist example

Hint: Mention @Herald in the post for followups.

Topic		Replies	Views
Continuous action space and custom model RLlib	4	1705	July 17, 2021
Is any multi discrete action example for PPO or other algorithms? RLlib	9	4651	January 29, 2023
There was an error changing the trajecy_tory_view_api into continuous action space RLlib	7	665	February 22, 2022
Observation dependent continuous action space ("Masking" continuous action space) RLlib	3	1196	February 9, 2022
Continuous action space Configure Algorithm, Training, Evaluation, Scaling	2	124	July 29, 2024

Continuous + discrete action space and custom model

Related topics