Questions and Confusion: Getting started with RLlib

How severe does this issue affect your experience of using Ray?

  • High: It blocks me to complete my task.

Hi, I’m doing my best to learn RLlib and i’ve read through the documentation and quite a few of the examples, however the high level of abstraction seems to be throwing me through a loop. Could i request some help please.

Project Overview:
The intention is to use the New API + Pytorch to train a modified PPO Agent in a Custom Single Agent Gymnasium env, with Action masking.

The Action space is Discrete(4672) with an Observation Shape of (8,8,111), the masking is done within env and the Observation space is fully defined as:

observation_space = spaces.Dict(
            {
                "observation": spaces.Box(low=0, high=1, shape=(8, 8, 111), dtype=bool),
                "action_mask": spaces.Box(low=0, high=1, shape=(4672,), dtype=np.int8),
            }
        )

Constructing the env itself wasn’t difficult, but pulling it all together has left me scratching my head.

What i would like advice on is:

  1. How to modify RLlibs PPO to include deal with image data, say a ResBlock as an input layer.
  2. How to combine this with Action masking.
  3. And if theres time (or paitence left) how to get this up and training.

Heres what i’ve managed to figure out thus far:

According to the Environments page, I can register my custom environement using ray tune like so:

# Pulled from the docs:
from ray.tune.registry import register_env

def env_creator(config):
    return MyDummyEnv(config)  # Return a gymnasium.Env instance.

register_env("my_env", env_creator)

This is straightforward and makes sense. When i load config i can call .environment and specify my newly registered env. all good.

My difficulty comes in with modifying PPO:

from RL Modules if i just wanted to use a default CNN stack, it looks like i should do

# Pulled from the Docs
from ray.rllib.algorithms.ppo import PPOConfig
from ray.rllib.core.rl_module.default_model_config import DefaultModelConfig

config = (
    PPOConfig()
    .environment("my_env")
    .rl_module(
        model_config=DefaultModelConfig(
            # Use a DreamerV3-style CNN stack.
            conv_filters=[
                [16, 4, 2],  # 1st CNN layer: num_filters, kernel, stride(, padding)?
                [32, 4, 2],  # 2nd CNN layer
                [64, 4, 2],  # etc..
                [128, 4, 2],
            ],
            conv_activation="silu",

            # After the last CNN, the default model flattens, then adds an optional MLP.
            head_fcnet_hiddens=[256],
        )
    )
)

But doesn’t this mean i’m unable to use the action masking as well? The example in action_masking_rl_module.py suggests I do

from ray.rllib.examples.rl_modules.classes.action_masking_rlm import (
    ActionMaskingTorchRLModule,

config = (
        PPOConfig()
        .environment('my_env')
        .rl_module(
            # We need to explicitly specify here RLModule to use and
            # the catalog needed to build it.
            rl_module_spec=RLModuleSpec(
                module_class=ActionMaskingTorchRLModule,
                model_config={
                    "head_fcnet_hiddens": [64, 64],
                    "head_fcnet_activation": "relu",
                },
            ),
        )
)

Now i understand that ActionMaskingTorchRLModule is a custom RL Module and i begin to get lost at this point.

from Best ways to customize a PPO algorithm variant in Ray2.8.0 the answer seems to be that I need to create a new PPORLModule and PPOTorchRLModule?

or is it as simple as going

from ray.rllib.examples.rl_modules.classes.action_masking_rlm import (
    ActionMaskingTorchRLModule,

config = (
        PPOConfig()
        .environment('my_env')
        .rl_module(
            # We need to explicitly specify here RLModule to use and
            # the catalog needed to build it.
            rl_module_spec=RLModuleSpec(
                module_class=ActionMaskingTorchRLModule,
                model_config={
                "conv_filters":[
                [16, 4, 2],  # 1st CNN layer: num_filters, kernel, stride(, padding)?
                [32, 4, 2],  # 2nd CNN layer
                [64, 4, 2],  # etc..
                [128, 4, 2],
            ],
            "conv_activation":"silu",
            "head_fcnet_hiddens": [64, 64],
            "head_fcnet_activation": "relu",
                },
            ),
        )
)

or some logical variation of?

If i’m wrong, please could you point me in the right direction?

Thanks