Handling lots of policies and large obs spaces

1. Severity of the issue: (select one)
None: I’m just curious or want clarification.
Low: Annoying but doesn’t hinder my work.
Medium: Significantly affects my productivity but can find a workaround.
High: Completely blocks me.

2. Environment:

  • Ray version: 2.55.1
  • Python version: 3.14.5
  • OS: macOS and linux
  • Cloud/Infrastructure: -
  • Other libs/tools (if relevant):

3. What happened vs. what you expected:

  • Expected: Algorithm builds successfully
  • Actual: OOM

Hello everybody. I am running league-play on a territorial-based game (territorial.io clone). Now I am scaling up my amount of players from 8 to 32 and map size from 80x80 to 256x256, but this disallows me to even build the algo for evaluation due to OOM: on my mac it fills 40 GB of RAM right away.

  • My obs space: Dict(‘action_mask’: Box(0.0, 1.0, (44,), float32), ‘observations’: Box(-1, 1, (132, 256, 256), int8), ‘stats’: Box(0.0, 1.0, (64,), float32)) → (8650860,) (132 channels because of stacking 4 frames to the model and each frame is OHE for 32 players). stats are being fed to the FC part of my CNN.
  • My model is a middle-sized CNN with ±5M params.
    I see that memory starts to skyrocket instantly.
  • I am trying to set up 1 trainable policy and a pool of 32 bot algorithmic policies (no model at all) + 32 frozen copies of trainable one, which are being updated in league play. I sample those policies to form a new league in league play.

Here is my config:

-----common.py-----

from strategy_config import N_PLAYERS
from ray.rllib.algorithms.ppo import PPOConfig
from prepare_env import ENV_NAME
from players.model import MODEL_NAME
from callback import MergedCallback
from league_play_callback import policies


config = (
    PPOConfig()
    .environment(
        env=ENV_NAME,
        clip_actions=True,
        disable_env_checking=False,
    )
    .multi_agent(
        policies=policies,
        policy_mapping_fn=(
            lambda aid, *args, **kwargs: "p0"
        ),  # will be changed in league callback
        count_steps_by="agent_steps",
        policies_to_train=["p0"],
        policy_map_capacity=N_PLAYERS,
    )
    .debugging(log_level="DEBUG")
    .framework(framework="torch")
    .api_stack(
        enable_rl_module_and_learner=False,
        enable_env_runner_and_connector_v2=False,
    )
    .callbacks(MergedCallback)
    # .checkpointing(checkpoint_trainable_policies_only=True)
)
training_params = dict(
    train_batch_size=None,
    minibatch_size=None,
    lr=2e-5,
    gamma=0.999,
    lambda_=0.9,
    use_gae=True,
    clip_param=0.2,
    grad_clip=0.5,
    entropy_coeff=0.01,
    vf_loss_coeff=0.25,
    num_epochs=10,
    model={"custom_model": MODEL_NAME},
)
env_runners_params = dict(batch_mode="truncate_episodes")

from configs.common import config, training_params, env_runners_params


----mac_config.py-----

CONFIG_NAME = "MAC"
training_params["train_batch_size"] = 64
training_params["minibatch_size"] = 16
config = config.training(**training_params).env_runners(
    num_env_runners=2, **env_runners_params, num_cpus_per_env_runner=2
)
NUM_CPUS = 4
NUM_GPUS = 0

I understood that build_algo() setups workers that run a sample batch for initializing losses and stuff, but i don’t understand why does this stack up to enormous 40GB of memory, when 1 obs sample pass and policy weights doesn’t take more than 300MB in total. Why does this happen?

Any help is appreciated!

The high memory usage is likely due to RLlib’s handling of large observation spaces and many policies in league-based setups. When you have a large observation space (e.g., (132, 256, 256)) and many policies (trainable, frozen, bots), RLlib’s PolicyMap may keep many policy objects in memory, and each policy’s initialization can trigger large sample batches and data structures, quickly consuming RAM. This is especially true during build_algo(), which initializes all policies and may create large dummy batches for each, leading to high memory usage even before training starts. This behavior is documented for league/self-play setups and large obs spaces, and is a known scaling challenge in RLlib (PolicyMap source, league-based example).

To mitigate, consider reducing the number of policies in memory (e.g., by lowering PolicyMap capacity or using swappable policy states), shrinking the observation space if possible, or customizing policy/model initialization to avoid unnecessary memory allocation. RLlib’s PolicyMap supports caching and swapping policies to disk to avoid OOM, but with very large obs spaces, even a single policy’s dummy batch can be huge. See PolicyMap docs and league-based training discussion for more details. Would you like a step-by-step breakdown or code-level suggestions?

Sources:

Hint: Mention @RunLLM in the post for followups.

@RunLLM why do I get 40GB of memory usage, if a single policy and batch won’t take more than 300MB? does algorithm try to setup the policies all at once? can i change this behaviour?

Yes, RLlib’s Algorithm builds all policies at once during initialization, and for each policy, it creates large dummy batches (matching your observation space) to initialize losses and model structures. With many policies and a large obs space, this multiplies memory usage rapidly, leading to OOM even if a single policy/batch is small (PolicyMap source, slow initialization discussion).

You can partially control this by reducing the number of policies initialized at once, lowering the policy_map_capacity, or using policy_states_are_swappable if your policies share architecture. However, RLlib does not natively support fully lazy or sequential policy initialization out-of-the-box; all policies in the config are built up front (PolicyMap source, algorithm config). Would you like more detail or code examples?

Sources:

Hint: Mention @RunLLM in the post for followups.

@RunLLM

if i can’t interfere with the setup, then can i tame the total memory usage directly, at least for rllib not to kill my mac?

You cannot directly enforce a hard memory cap on RLlib policies during setup; RLlib’s memory resource specification is only for bookkeeping and does not limit actual memory usage (memory management discussion). To avoid OOM, you should reduce the number of policies in memory (lower policy_map_capacity), shrink the observation space, or run fewer workers. RLlib will otherwise allocate as much memory as needed for all policies and their dummy batches during initialization.

If you need to prevent your system from being killed, limit the number of concurrent policies or processes, and monitor memory usage externally. RLlib does not provide a built-in mechanism to enforce a strict memory ceiling during algorithm setup (memory management discussion). Would you like more detail or practical workarounds?

Sources:

Hint: Mention @RunLLM in the post for followups.

Raising policy map capacity to 100 eased the memory pressure and sped up the setup drastically for some reason.

@RunLLM any explanations?

Raising policy_map_capacity to 100 allows RLlib’s PolicyMap to keep more policies in memory before swapping them to disk. When capacity is too low relative to the number of policies, RLlib frequently stashes and reloads policies from disk, causing high memory pressure and slow initialization due to constant serialization/deserialization (slow initialization discussion, PolicyMap source).

By increasing policy_map_capacity above your number of active policies, you avoid this overhead, resulting in lower memory thrashing and much faster setup. This is a known issue when using many policies in league/self-play setups (slow initialization discussion). Would you like a step-by-step explanation?

Sources:

Hint: Mention @RunLLM in the post for followups.