Reproducible training - setting seeds for all workers / environments

You could do this via the env maker, like so:

import gym

class MyEnvClass(gym.Env):
    def __init__(self, config):
        self.property1 = config.get("property1")  # <- use some keys from the config
        worker_idx = config.worker_index  # <- but you can also use the worker index
        num_workers = config.num_workers  # <- or the total number of workers
        vector_env_index = config.vector_index  <- or the vector env index
        ...
        # to set the seed
        self.seed(worker_idx + num_workers + vector_env_index)

from ray.tune import register_env

register_env("my_seeded_env", lambda config: MyEnvClass(config))

2 Likes

Thanks for raisin this @Lauritowal . I’ll add this to the custom_env example script so we have an example to point to in the future! :slight_smile:

3 Likes

Hi!
I tried it with my custom env, however the config EnvContext is always an empy: { } in my environment…

    ray.init()

    register_env("myenv", lambda config: MyEnv(
            phase=0,
            rllib_config=config
        ))

    default_config = td3.TD3_DEFAULT_CONFIG.copy()
    custom_config = {
        "env":"myenv",
        "lr": 0.0001, 
        "num_gpus": 0,
        "framework": "torch",
        "callbacks": CustomCallbacks,
        "log_level": "WARN",
        "evaluation_interval": 20,
        "evaluation_num_episodes": 10,
        "num_workers": 3,
        "num_envs_per_worker": 3,
        "seed": 3
    }
    config = {**default_config, **custom_config}

    resources = TD3Trainer.default_resource_request(config).to_json()

    # start training
    now = datetime.datetime.now().strftime("date_%d-%m-%Y_time_%H-%M-%S")
    tune.run(my_train_fn,
             name=f"test_{now}",
             resources_per_trial=resources,
             config=config)

And in my environment I do:

rllib_seed = seed + rllib_config.worker_index + rllib_config.num_workers + rllib_config.vector_index

As I said above rllib_config is always an empty dict.

Any idea why?

Thank you
Walter

It’s empty, b/c your “env_config” key in your custom_config is not set.
Can you try doing this?

custom_config = {
        "env":"myenv",
        "env_config": { [some config for your env] },
        ...

RLlib will take the “env_config” dict and create a EnvContext object from it, which is also just a dict plus the properties: worker_index, num_workers, remote, and vector_index.

1 Like

Everything works now as expected. Thank you very much @sven1977 :slight_smile:

2 Likes

Hi,
Thanks for posting this!

I got a bit confused by the above answer, the custom_env example and the config parameters…
Does config[‘seed’] “seeds” a different random engine than the self.seed in @sven1977 's answer?

Thanks

@2dm

  • self.seed does set the seed in the environment.
  • config[“seed”] is the property seed you pass to the environment.

Then you can do: self.seed(config["seed"]) for example,

or self.seed(config["seed"] + worker_idx + num_workers + vector_env_index)
if you are using multiple workers and parallalel environments to set the seed in your environment.

@Lauritowal
Thanks, that is clear now that I need to set the seed in my environment.
Although, the config[‘seed’] is a trainer config parameter, and not a part of env_config (according to this doc).
So if I use it as in the doc, for example:

custom_config = {
        "env":"myenv",
        "env_config": { [some config for your env] },
        "seed": 123,
        ...

the “seed” is not accessible from the environment (only what is in env_config)

So does it hold a different role or am I missing something?

@2dm
I may not have the correct answer to this, but I think the OP wanted to set seeds for a custom environment. My own experience of using the config[“seed”] value for a DQN Trainer using a pettingzoo env resulted in identical results for num_workers = 24 and num_envs_per_worker = 8.

Not sure but this might help?
rllib/examples/deterministic_training.py

Place in rolloutworker.py where the seed is being set for numpy, random, the environment and tf/pytorch. From what I understand, the seed is being set here for random, numpy, tf as well as the env if it supports seeding.

2 Likes

Thanks @rfali !
That solved my confusion.

I thought I needed to pass seed inside env_config as well, but rolloutworker.py really cleared out how it is set.

3 Likes

That’s great.

While we are on the topic of seeding, do you agree with this? This post is asking to seed the action_space too, which I am not entirely sure about. The post says that doing
env.action_space.sample()
gave him different results, and therefore found that this
env.action_space.seed(RANDOM_SEED)
solves the reproduciblity issue for him.

Isn’t this supposed to make even random actions (that is what sample() does I think) deterministic?
The reason I am asking this is because if setting env.action_space.seed(seed) is a good idea, than I didn’t see it in rolloutworker.py.

2 Likes

Hey @rfali and everyone here (@Lauritowal , @2dm), thanks for your help above! It never occurred to me that one would have to seed the action space as well. The Rllib “deterministic_training.py” example script you posted above does seem to work w/o this. Could you post a reproduction script that proves that seeding the action space is also necessary? If so, we can just add an extra seeding line to RLlib to fix it.

@sven1977 thanks for looking at this. I agree that this needs further investigation. The Rllib “deterministic_training.py” example script uses PPO as policy and the post about seeding action_space is about taking a random action (and wanting to sample that action deterministically).

As can be seen in the gym wiki, the env.action_space.sample() selects a random action, uniformly sampled from the action_space. See the function definition for discrete spaces. If the action_space is not seeded, then the sampled action is bound to change. Here is a relevant gym github issue that talks about this as well. Reproducing the script from this issue, I tried doing the following:

a1 = []
a2 = []

env1 = gym.make('FrozenLake-v0')
env1.seed(0)

s1 = env1.reset()

for _ in range(5):
    a1.append(env1.action_space.sample())

    
env2 = gym.make('FrozenLake-v0')
env2.seed(0)

#seeding the action_space, both have the same effect but different set of actions
env2.action_space.seed(0)
#env2.action_space.np_random.seed(0)

s2 = env2.reset()

for _ in range(5):
    a2.append(env2.action_space.sample())


print('actions sampled: env1', a1)
print('actions sampled: env2', a2)

with the result

actions sampled: env1 [0, 1, 0, 1, 1] # this will change
actions sampled: env2 [0, 3, 1, 0, 3] # this will always remain the same

The function definition of space.np_random() is here.

I tried applying a RandomPolicy from here on the “deterministic_training.py” but ran into some errors that I didn’t understand.

I am not entirely sure whether one would want to get deterministic actions out of env.action_space.sample(), but as discussed in the aforementioned gym issue, setting a seed value sets an expectation of determinism from the environment. Therefore, it might come as a surprise to a user if an agent is sampling different ‘random’ actions between runs.

Finally, this discussion on Garage may shed light on how libraries should handle this. It appears to me that Garage maintainers came to the conclusion that when setting a seed, the env.action_space.sample() should also be seeded.

1 Like

@sven1977 Similar to rfali I also did what was described in Reproducibility issues using OpenAI Gym – Harald's blog to get the same action sequence for a specific seed in my custom env during training:

env.action_space.seed(RANDOM_SEED)

(I was using TD3)

If I don’t do that, I’ve noticed that the actions taken by the agent would be different on the second run using the same seed when training via rllib. (seeding the action space is not needed when using Stable Baselines 2 or 3 by the way, if I remember correctly. Maybe this is somehow helpful)

1 Like

Found another evidence/example of action_space seeding, this time in the original TD3 repo of S Fujimoto, first-author of TD3 paper. Here is the commit.

1 Like

Even after seeding the action space, I still cannot get the same results for multiple runs.

Could anyone reproduce the exact results for the following toy example with the latest code base? Thanks! :)

Below is the results for two runs. Each run has the same seed value of zero.

import argparse
import os

from ray.rllib.examples.env.stateless_cartpole import StatelessCartPole
from ray.rllib.utils.test_utils import check_learning_achieved

parser = argparse.ArgumentParser()
parser.add_argument(
    "--run",
    type=str,
    default="PPO",
    help="The RLlib-registered algorithm to use.")
parser.add_argument("--num-cpus", type=int, default=0)
parser.add_argument(
    "--framework",
    choices=["tf", "tf2", "tfe", "torch"],
    default="torch",
    help="The DL framework specifier.")
parser.add_argument("--eager-tracing", action="store_false")
parser.add_argument("--use-prev-action", action="store_true")
parser.add_argument("--use-prev-reward", action="store_true")
parser.add_argument(
    "--as-test",
    action="store_true",
    help="Whether this script should be run as a test: --stop-reward must "
    "be achieved within --stop-timesteps AND --stop-iters.")
parser.add_argument(
    "--stop-iters",
    type=int,
    default=4,
    help="Number of iterations to train.")
parser.add_argument(
    "--stop-timesteps",
    type=int,
    default=1000000,
    help="Number of timesteps to train.")
parser.add_argument(
    "--stop-reward",
    type=float,
    default=15000.0,
    help="Reward at which we stop training.")

if __name__ == "__main__":
    import ray
    from ray import tune

    args = parser.parse_args()
    cwd_path = os.getcwd()
    print('cwd path', cwd_path)
    logdir = cwd_path + '/log'
    print(logdir)
    if not os.path.isdir(logdir):
        os.mkdir(logdir)
    ray.init()

    configs = {
        "PPO": {
            "num_sgd_iter": 5,
            "model": {
                "vf_share_layers": False,
            },
            "vf_loss_coeff": 0.0001,
        },
        "IMPALA": {
            "num_workers": 2,
            "num_gpus": 0,
            "vf_loss_coeff": 0.01,
        },
    }

    config = dict(
        configs[args.run],
        **{
            "env": StatelessCartPole,
            # Use GPUs iff `RLLIB_NUM_GPUS` env var set to > 0.
            "num_gpus": 0,
            "num_workers": 4,
            "num_gpus_per_worker": 0,#(2 - 1) / (1+1),
 
            "grad_clip": 0.5, #tune.grid_search([0.5, 0.5, 0.5]),#tune.grid_search([1, 5]),

            "seed": tune.grid_search([0, 0, 0]),
            "model": {
                "use_lstm": True,
                "lstm_cell_size": 256,
                "lstm_use_prev_action": args.use_prev_action,
                "lstm_use_prev_reward": args.use_prev_reward,
            },
            "framework": args.framework,
            # # Run with tracing enabled for tfe/tf2?
            # "eager_tracing": args.eager_tracing,
        })

    stop = {
        "training_iteration": args.stop_iters,
        "timesteps_total": args.stop_timesteps,
        "episode_reward_mean": args.stop_reward,
    }

    # To run the Trainer without tune.run, using our LSTM model and
    # manual state-in handling, do the following:

    # Example (use `config` from the above code):
    # >> import numpy as np
    # >> from ray.rllib.agents.ppo import PPOTrainer
    # >>
    # >> trainer = PPOTrainer(config)
    # >> lstm_cell_size = config["model"]["lstm_cell_size"]
    # >> env = StatelessCartPole()
    # >> obs = env.reset()
    # >>
    # >> # range(2) b/c h- and c-states of the LSTM.
    # >> init_state = state = [
    # ..     np.zeros([lstm_cell_size], np.float32) for _ in range(2)
    # .. ]
    # >> prev_a = 0
    # >> prev_r = 0.0
    # >>
    # >> while True:
    # >>     a, state_out, _ = trainer.compute_single_action(
    # ..         obs, state, prev_a, prev_r)
    # >>     obs, reward, done, _ = env.step(a)
    # >>     if done:
    # >>         obs = env.reset()
    # >>         state = init_state
    # >>         prev_a = 0
    # >>         prev_r = 0.0
    # >>     else:
    # >>         state = state_out
    # >>         prev_a = a
    # >>         prev_r = reward

    results = tune.run(args.run, config=config, verbose=3, checkpoint_freq=50, local_dir=logdir, stop=stop)

    if args.as_test:
        check_learning_achieved(results, args.stop_reward)
    ray.shutdown()

Wouldn’t that lead to the fact, that {worker_index=0, vector_index=1} and {worker_index=1, vector_index=0} create the same environment (i.e. sampling the same sequence of starting positions, random elements in env,…)?

Hi Sven, great info.

I think this is some very important detail about rollout workers, however I failed to find any document covering this kind of “hidden” info regarding env_config, besides the code.

Maybe I missed something or is there any documentation topic on this one? If not, is it worth to add it? I could also submit an request in Github issues if needed.

Best regards,
Ian