How to use PPO with Dict observation space (pixels + features) in Ray 2.48.0?

Context

I’m training a PPO agent in RLlib (Ray 2.48.0) with a custom Gymnasium environment that returns a Dict observation space containing both pixels and vector features:

import gymnasium as gym
import numpy as np

obs_space = gym.spaces.Dict({
    "pixels": gym.spaces.Box(0.0, 1.0, (84, 84, 4), dtype=np.float32),
    "features": gym.spaces.Box(-1.0, 1.0, (9,), dtype=np.float32),
})

The step() returns:

{
    "pixels": np.zeros((84, 84, 4), np.float32),
    "features": np.zeros(9, np.float32),
}

Problem
When running PPO with this env:

from ray.rllib.algorithms.ppo import PPOConfig
from ray.tune.registry import register_env

class DummyEnv(gym.Env):
    def __init__(self, cfg=None):
        self.observation_space = obs_space
        self.action_space = gym.spaces.Discrete(4)
    def reset(self, *, seed=None, options=None):
        return { "pixels": np.zeros((84,84,4), np.float32),
                 "features": np.zeros(9, np.float32) }, {}
    def step(self, action):
        return self.reset()[0], 0.0, False, False, {}

register_env("dummy", lambda cfg: DummyEnv())

cfg = (PPOConfig()
       .environment("dummy")
       .framework("torch"))

algo = cfg.build()

I get:

ValueError: No default encoder config for obs space=Dict('features': Box(-1.0, 1.0, (9,), float32),
'pixels': Box(0.0, 1.0, (84, 84, 4), float32)), lstm=False found.

Question

  • What is the recommended way in Ray 2.48.0 to handle such Dict spaces (CNN for "pixels" and MLP for "features", then concatenate)?

  • Do I need to manually define a custom RLModuleSpec / Catalog for this, or is there a built-in default?

  • If a manual config is required, could you provide a minimal example (Torch backend, PPO)?

System Info

  • Ray 2.48.0

  • Python 3.10

Workaround tested
Flattening the Dict works, but then "pixels" are treated as a flat vector and CNN processing is lost. Ideally, I’d like RLlib to auto-create a CNN branch for "pixels" and an MLP branch for "features".

The action masking example should give you the information you need on handling dictionary observation spaces in the way the system expects.

You might also want to look at some examples featuring custom encoder architectures. This one should cover the bases well enough.