Reproducibility with seeds and ray tune / rllib

1. Severity of the issue: (select one)
High: Completely blocks me.

2. Environment:

  • Ray version: 2.51.1
  • Python version: 3.12.10
  • OS: Windows 11

3. What happened vs. what you expected:

  • Expected: Setting the appropriate seeds ensures reproducibility with ray tune
  • Actual: I am setting the seeds according to the documentation, however I can not ensure reproducibility between ray tune experiments.

In the following section you can see the Code of my main.py, where I set the seeds for all sorts of things, however I just can not get reproducibility with ray tune. Am I missing something? The docs on this are not very comprehensive and all the posts I could find on here are over three years old.

import random
from pathlib import Path

import numpy as np
import ray
import torch
from ray import tune
from ray.air.integrations.wandb import WandbLoggerCallback
from ray.rllib.core.rl_module import MultiRLModuleSpec, RLModuleSpec
from ray.rllib.examples.algorithms.mappo.mappo import MAPPOConfig
from ray.rllib.examples.algorithms.mappo.torch.shared_critic_torch_rl_module import SharedCriticTorchRLModule
from ray.tune.registry import register_env

from callbacks import MetricsLoggerCallback
from config.config import GENERATE_RANDOM_ROUTES, DEBUG, LOG_TO_WANDB
from rl_environment.observation_classes import CameraObservation, NoisyCameraObservation
from rl_environment.sumo_traffic_env import SumoTrafficEnv

SHARED_CRITIC_ID = "shared_critic"
SEED = 100

def env_creator(env_config):
    current_file = Path(__file__)
    project_base = current_file.parent.parent

    # Pfade und maximale Simulationszeit festlegen
    net: Path = project_base / "simulation_files" / "net.net.xml"
    route: Path = project_base / "simulation_files" / "random.rou.xml"
    trip: Path = project_base / "simulation_files" / "random.trips.xml"
    additional: Path = project_base / "simulation_files" / "mytypes.add.xml"

    return SumoTrafficEnv(
        sumo_net_file=net,
        sumo_route_file=route,
        sumo_trip_file=trip,
        sumo_additional_file=additional,
        reward_function="negative_accumulated_waiting_time_since_last_step",
        observation_class=NoisyCameraObservation,
        show_gui=False,
        simulation_time=600,
        generate_random_routes=GENERATE_RANDOM_ROUTES,
        sumo_simulation_seed=str(SEED)
    )


if __name__ == "__main__":
    random.seed(SEED)
    np.random.seed(SEED)
    torch.manual_seed(SEED)

    initial_data_env = env_creator({})

    agent_ids = initial_data_env.agents

    ray.init(local_mode=DEBUG)

    register_env("sumo_marl", env_creator)

    observation_space = initial_data_env.observation_spaces[agent_ids[0]]
    action_space = initial_data_env.observation_spaces[agent_ids[0]]

    policies = [f"p_{agent_id}" for agent_id in agent_ids]

    specs = {p: RLModuleSpec() for p in policies}
    specs[SHARED_CRITIC_ID] = RLModuleSpec(
        module_class=SharedCriticTorchRLModule,
        observation_space=observation_space,
        action_space=action_space,
        learner_only=True,
        model_config={"observation_spaces": initial_data_env.observation_spaces},
    )

    config = (
        MAPPOConfig()
        .environment("sumo_marl")
        .env_runners(
            num_env_runners=1 if not DEBUG else 0,
            num_envs_per_env_runner=1,
            num_cpus_per_env_runner=3,
            sample_timeout_s=50000,
        )
        .multi_agent(
            policies=policies + [SHARED_CRITIC_ID],
            policy_mapping_fn=lambda aid, *a, **kw: f"p_{aid}",
        )
        .rl_module(
            rl_module_spec=MultiRLModuleSpec(
                rl_module_specs=specs,
            ),
        )
        .learners(
            num_learners=0,
            num_cpus_per_learner=3,
        )
        .training(
            train_batch_size=6000,
            minibatch_size=256,
        )
        .callbacks(MetricsLoggerCallback)
        .debugging(
            seed=SEED
        )

    )

    param_space = config.to_dict()
    param_space["seed"] = SEED

    # Training starten
    tuner = tune.Tuner(
        config.algo_class,
        param_space=param_space,
        run_config=tune.RunConfig(
            name="SharedCritic",
            stop={"training_iteration": 100},
            verbose=1,
            callbacks=[WandbLoggerCallback(project="new_env", name="MAPPO CTDE Individual Policies NoisyCameraObservation NegativeAccumulatedWaitingTimeSinceLastStep")] if LOG_TO_WANDB else None,
            checkpoint_config=tune.CheckpointConfig(
                checkpoint_frequency=10, checkpoint_at_end=True
            ),
        ),
    )

    tuner.fit()

    # algo = config.build_algo()
    # print(algo.train())

To ensure reproducibility in Ray Tune, you must set seeds for all random number generators (Python, NumPy, PyTorch, etc.) and also pass the seed to Ray Tune’s config and search algorithms. However, even with all seeds set, full reproducibility is not always guaranteed due to sources of non-determinism such as parallel execution, asynchronous sampling, or GPU operations. The Ray documentation emphasizes that you should set the seed at the start of your training function and for the search algorithm, but notes that some sources of non-determinism (e.g., trial scheduling, GPU parallelism) may still affect results. See Ray Tune FAQ and example code.

Your code sets seeds for Python, NumPy, and PyTorch, and passes the seed to the config, which is correct. However, reproducibility can still be affected if, for example, your environment or any library used inside it introduces randomness that is not seeded, or if you use multiple workers or GPUs. Also, if you use schedulers like ASHA or PBT, or if your environment is not fully deterministic, results may still differ between runs. For strict reproducibility, ensure all randomness in your environment is controlled, avoid asynchronous algorithms, and use CPU-only training if possible. See Ray Tune FAQ and deterministic training example.

Would you like a step-by-step checklist to further debug your reproducibility issue?

Sources:

Hint: Mention @RunLLM in the post for followups.

@RunLLM I think I do have all of your checklist items implemented. I still can not assure full reproducibility.

Even with all recommended seeding and configuration, Ray Tune and RLlib cannot guarantee full reproducibility in all cases. This is due to inherent non-determinism in distributed execution, parallel sampling, trial scheduling, and some third-party libraries (especially on GPU). The Ray documentation explicitly states that “it is not always possible to control all sources of non-determinism,” and that factors like trial completion order, asynchronous algorithms, and environment randomness can still cause results to differ between runs, even with the same seed set everywhere (Ray Tune FAQ).

If you require strict reproducibility, you may need to limit your setup to a single worker, avoid asynchronous schedulers, use CPU-only training, and ensure your environment is fully deterministic. However, even then, some non-determinism may persist due to the underlying system or libraries (Ray Tune FAQ, Deterministic training example).

Would you like more detail on specific sources of non-determinism or further debugging steps?

Sources:

Hint: Mention @RunLLM in the post for followups.

@RunLLM I am setting a seed in the .debugging function of the AlgorithmConfig as described in the deterministic training example. My env is a MultiAgentEnv, so its reset method takes an optional seed. However, this seed is always None, so maybe this is the reason for missing reproducibility? Why is this seed None and where does it get passed from?

Yes, the seed being None in your MultiAgentEnv’s reset method can break reproducibility. In RLlib, the seed is passed to the environment’s reset method by the rollout worker or env runner, which computes the seed based on the config’s seed, worker index, and vector index (rollout_worker.py). If your environment is not being seeded, it may be due to not using the new gymnasium-style API, or because the environment is not wrapped correctly, or because the rollout worker is not calling reset with a seed.

To ensure the seed is passed, your environment must implement reset with the signature def reset(self, *, seed=None, options=None) and be compatible with the new API (custom env example). RLlib will then call reset with a computed seed. If you use the old API or a wrapper that does not forward the seed, it will remain None. See rollout_worker.py and custom env example.

Would you like a step-by-step guide to ensure your environment is seeded correctly?

Sources:

Hint: Mention @RunLLM in the post for followups.

@RunLLM I have finally achieved reproducibility for my problem. There seems to be a bug somewhere in the RLlib Code. At least I cant seem to find a reason why this happens. When I define my agent IDs as integers ranging from 0 to 15, reproducibility works. When my agent IDs are defined as strings, for example A0, A1, A2, A3, B0 and so on, reproducibility does not work. I don’t see any reason why one would work and the other wouldn’t.

It seems likely that RLlib’s reproducibility is affected by the type or ordering of agent IDs. There is no documented requirement that agent IDs must be integers—RLlib supports any hashable type, including strings (MultiAgentEnv docs). However, some internal RLlib logic or data structures may iterate over agent IDs in a non-deterministic order if they are strings, especially if sets or dicts are used, which can break reproducibility (discuss.ray.io).

My understanding is that using integer agent IDs may result in a consistent order, while string IDs could lead to unpredictable ordering, affecting random number generation and thus reproducibility. This could be considered a bug or at least an undocumented behavior in RLlib.

Would you like more detail on how to work around this or how to report it upstream?

Sources:

Hint: Mention @RunLLM in the post for followups.