Policy weights overwritten in self-play

Hi all!
I am trying a self-play based scheme, where I want to have two agents in waterworld environment have a policy that is being trained (“shared_policy_1”) and other 3 agents that sample a policy from a menagerie (set) of the previous policies of the first two agents ( “shared_policy_2”).
My problem is that I see that the weights in the menagerie are overwritten in every iteration by the current weights and I don’ tunderstand why this is happening. (the same happens by using a dictionary instead of a list for men variable)
So instead of having M previous policy weights in the menagerie, I have M times the current weights of training iteartion i.
You can check the .txt files created since I cannot upload them here.
Thanks in advance.

Please find the code attached :

from ray import tune
from ray.rllib.agents.callbacks import DefaultCallbacks
import argparse
import gym
import os
import random
import ray
import numpy as np
from ray.tune.registry import register_env
from ray.rllib.env.pettingzoo_env import PettingZooEnv
from pettingzoo.sisl import waterworld_v3

M = 5  # Menagerie size
men = []


class MyCallbacks(DefaultCallbacks):

    def on_train_result(self, *, trainer, result: dict, **kwargs):
        print("trainer.train() result: {} -> {} episodes".format(
            trainer, result["episodes_this_iter"]))
        i = result['training_iteration']    # starts from 1
        # the "shared_policy_1" is the only agent being trained
        print("training iteration:", i)
        global men

        if i <= M:
            # menagerie initialisation
            tmp = trainer.get_policy("shared_policy_1").get_weights()
            men.append(tmp)

            filename1 = 'file_init_' + str(i) + '.txt'
            textfile1 = open(filename1, "w")
            for element1 in men:
                textfile1.write("############# menagerie entries ##################" + "\n")
                textfile1.write(str(element1) + "\n")
            textfile1.close()

        else:
            # the first policy added is erased
            men.pop(0)
            # add current training policy in the last position of the menagerie
            w = trainer.get_policy("shared_policy_1").get_weights()
            men.append(w)
            # select one policy randomly
            sel = random.randint(0, M-1)

            trainer.set_weights({"shared_policy_2": men[sel]})

            weights = ray.put(trainer.workers.local_worker().save())
            trainer.workers.foreach_worker(
                lambda w: w.restore(ray.get(weights))
            )

        filename = 'file' + str(i) + '.txt'
        textfile = open(filename, "w")
        for element in men:
            textfile.write("############# menagerie entries ##################" + "\n")
            textfile.write(str(element) + "\n")

        # you can mutate the result dict to add new fields to return
        result["callback_ok"] = True


if __name__ == "__main__":

    ray.init()

    def env_creator(args):
        return PettingZooEnv(waterworld_v3.env(n_pursuers=5, n_evaders=5))

    env = env_creator({})
    register_env("waterworld", env_creator)
    obs_space = env.observation_space
    act_spc = env.action_space

    policies = {"shared_policy_1": (None, obs_space, act_spc, {}),
                "shared_policy_2": (None, obs_space, act_spc, {})
                }

    def policy_mapping_fn(agent_id):
        if agent_id == "pursuer_0" or "pursuer_1":
            return "shared_policy_1"
        else:
            return "shared_policy_2"

    tune.run(
        "PPO",
        name="PPO self play n = 5, M=5 trial 1",
        stop={"episodes_total": 50000},
        checkpoint_freq=10,
        config={
            # Enviroment specific
            "env": "waterworld",
            # General
            "framework": "torch",
            "callbacks": MyCallbacks,
            "num_gpus": 0,
            "num_workers": 0,
            # Method specific
            "multiagent": {
                "policies": policies,
                "policies_to_train": ["shared_policy_1"],
                "policy_mapping_fn": policy_mapping_fn,
            },
        },
    )

My concern is if it could be a multithreading / memory or RLlib related issue, since I can’t find something wrong with the code.
Thanks again.

Hi @george_sk,

What version of ray are you using. I ran your code as a quick test and I did see the weights changing with ray 1.4.0 but I am getting an error once it updates the weights after iteration 6. Seems to be related to the restore operation changing the ids of the param groups for the adam optimizer.

Manny

1 Like

Following up I had to change from restore to set_weights like this:

trainer.set_weights({"shared_policy_2": men[sel]})
weights = ray.put(trainer.get_weights("shared_policy_2"))
trainer.workers.foreach_worker(
                lambda w: w.set_weights(ray.get(weights))
            )
1 Like

Hi @mannyv and thanks for your reply.
I was running on ray version 1.2.0.
I tested on ray 1.4.0 and the weird thing is I get no error (I tested it in wondows 10 with python 3.8 and in Ubuntu 18 with python 3.6 )
With my weights syncing code, byt even with the weight syncing code you proposed I still get the same.

For exampe if I add (below men.append(tmp) command inside the if statement)

            tmp = trainer.get_policy("shared_policy_1").get_weights()
            men.append(tmp)

            filename2 = 'init_weights_' + str(i) + '.txt'
            textfile2 = open(filename2, "w")

            textfile2.write("############# menagerie entries ##################" + "\n")
            textfile2.write(str(tmp) + "\n")
            textfile2.close()

in order to see the current weights I get the results I attach to the images.
So I want to emphasize on th fact that ther is problem even in the menagerie initialisation procedure, befere any weight syncing.

The weights of the 3rd iteration (init_weights_3) are repeated in the menagerie (file_init_3_1st_entry, file_init_3_2nd_entry, file_init_3_3rd_entry,) and the weight of the second iteration (init_weights_2) are nowhere.





Hi @george_sk,

I am not sure why we are getting such different results. I have uploaded what I get for each file here: file1.txt{'_logits._model.0.weight': array([[ 0.00113152, -0.00399814, -0.0015 - Pastebin.com . Take a look and let me know what you think.

Hi @mannyv ,

yes the behaviour you get is the correct one. I am running on Pycharm. I will reinstall some things in case something is corrupted, but it is weird since we were getting even different errors.

Hi @george_sk,

I too am using pycharm with an anaconda environment. I do not think they should really matter but my system details are Ubuntu 20.04, python 3.7.9, ray 1.4.0.

Just to make sure we are on the same page and that I did not change something without realizing it here is the code I used to generate the files mentioned above:

from ray import tune
from ray.rllib.agents.callbacks import DefaultCallbacks
import argparse
import gym
import os
import random
import ray
import numpy as np
from ray.tune.registry import register_env
from ray.rllib.env.pettingzoo_env import PettingZooEnv
from pettingzoo.sisl import waterworld_v3

M = 5  # Menagerie size
men = []


class MyCallbacks(DefaultCallbacks):

    def on_train_result(self, *, trainer, result: dict, **kwargs):
        print("trainer.train() result: {} -> {} episodes".format(
            trainer, result["episodes_this_iter"]))
        i = result['training_iteration']    # starts from 1
        # the "shared_policy_1" is the only agent being trained
        print("training iteration:", i)
        global men

        if i <= M:
            # menagerie initialisation
            tmp = trainer.get_policy("shared_policy_1").get_weights()
            men.append(tmp)

            filename1 = 'file_init_' + str(i) + '.txt'
            textfile1 = open(filename1, "w")
            for element1 in men:
                textfile1.write("############# menagerie entries ##################" + "\n")
                textfile1.write(str(element1) + "\n")
            textfile1.close()

        else:
            # the first policy added is erased
            men.pop(0)
            # add current training policy in the last position of the menagerie
            w = trainer.get_policy("shared_policy_1").get_weights()
            men.append(w)
            # select one policy randomly
            sel = random.randint(0, M-1)

            trainer.set_weights({"shared_policy_2": men[sel]})

            weights = ray.put(trainer.get_weights("shared_policy_2"))
            trainer.workers.foreach_worker(
                lambda w: w.set_weights(ray.get(weights))
            )

        filename = 'file' + str(i) + '.txt'
        textfile = open(filename, "w")
        for element in men:
            textfile.write("############# menagerie entries ##################" + "\n")
            textfile.write(str(element) + "\n")

        # you can mutate the result dict to add new fields to return
        result["callback_ok"] = True


if __name__ == "__main__":

    ray.init(local_mode=True)

    def env_creator(args):
        return PettingZooEnv(waterworld_v3.env(n_pursuers=5, n_evaders=5))

    env = env_creator({})
    register_env("waterworld", env_creator)
    obs_space = env.observation_space
    act_spc = env.action_space

    policies = {"shared_policy_1": (None, obs_space, act_spc, {}),
                "shared_policy_2": (None, obs_space, act_spc, {})
                }

    def policy_mapping_fn(agent_id):
        if agent_id == "pursuer_0" or "pursuer_1":
            return "shared_policy_1"
        else:
            return "shared_policy_2"

    tune.run(
        "PPO",
        name="PPO self play n = 5, M=5 trial 1",
        stop={"training_iteration":10}, #{"episodes_total": 50000},
        checkpoint_freq=10,
        config={
            # Enviroment specific
            "env": "waterworld",
            # General
            "framework": "torch",
            "callbacks": MyCallbacks,
            "num_gpus": 0,
            "num_workers": 0,
            # Method specific
            "multiagent": {
                "policies": policies,
                "policies_to_train": ["shared_policy_1"],
                "policy_mapping_fn": policy_mapping_fn,
            },
        },
    )

This is the command I used to get the weights snippet for each file:

for i in {1..10}; do echo file$i.txt && grep -R -A3 "{'_logits._model.0.weight'" file$i.txt; done > weights.txt

Hi @mannyv ,

I ran the code you attached and it runs correctly on me too (on the machines I was using before, in both windows and linux).

The only change that there is in your code, compared to mine is the command:

    ray.init(local_mode=True)

I was using only the ray.init()
If I put this command without local_mode=True, then it has the previous wrong behaviour. But should I run it always with local_mode=True in order to get correct results?
Because I read that it is for debugging and might have some limitations.

@george_sk,

I just verified that setting local_mode=False dose induce the buggy behavior you described. In general local_mode=True is for debugging because it will run everything under one process.

I figured out what is happening. When you call get_weights it returns a dictionary of numpy arrays with one entry per parameter in the model. It looks like what is happening is that when local_mode=True the numpy arrays in the weight dictionary are references to unique objects but when local_mode=False the numpy arrays are references to the same numpy array and the values are changing in the learn_on_batch steps. This means that you end up with a list of dictionaries that all have the references to the same numpy object so when it updates every dictionary is also updated.

To fix this I had to add a deepcopy to the dictionary. I know it is the numpy objects and not the dictionary that has the same references because copy.copy did not work it has to be copy.deepcopy.

Note I also moved the men array from being a global variable to a class member.
You could also do the same for M.

@rliaw, @sven1977 Were you aware of this behavior?

Here is a working version of the code:

from ray import tune
from ray.rllib.agents.callbacks import DefaultCallbacks
import argparse
import gym
import os
import random
import ray
import numpy as np
from ray.tune.registry import register_env
from ray.rllib.env.pettingzoo_env import PettingZooEnv
from pettingzoo.sisl import waterworld_v3
import copy

M = 5  # Menagerie size


class MyCallbacks(DefaultCallbacks):
    def __init__(self):
        super(MyCallbacks, self).__init__()
        self.men = []

    def on_train_result(self, *, trainer, result: dict, **kwargs):
        print("trainer.train() result: {} -> {} episodes".format(
            trainer, result["episodes_this_iter"]))
        i = result['training_iteration']    # starts from 1
        # the "shared_policy_1" is the only agent being trained
        print("training iteration:", i)

        if i <= M:
            # menagerie initialisation
            tmp = copy.deepcopy(trainer.get_policy("shared_policy_1").get_weights())
            self.men.append(tmp)

            filename1 = 'file_init_' + str(i) + '.txt'
            textfile1 = open(filename1, "w")
            for element1 in self.men:
                textfile1.write("############# menagerie entries ##################" + "\n")
                textfile1.write(str(element1) + "\n")
            textfile1.close()

        else:
            # the first policy added is erased
            self.men.pop(0)
            # add current training policy in the last position of the menagerie
            w = copy.deepcopy(trainer.get_policy("shared_policy_1").get_weights())
            self.men.append(w)
            # select one policy randomly
            sel = random.randint(0, M-1)

            trainer.set_weights({"shared_policy_2": self.men[sel]})

            weights = ray.put(trainer.get_weights("shared_policy_2"))
            trainer.workers.foreach_worker(
                lambda w: w.set_weights(ray.get(weights))
            )

        filename = 'file' + str(i) + '.txt'
        textfile = open(filename, "w")
        for element in self.men:
            textfile.write("############# menagerie entries ##################" + "\n")
            textfile.write(str(element) + "\n")

        # you can mutate the result dict to add new fields to return
        result["callback_ok"] = True


if __name__ == "__main__":

    ray.init()

    def env_creator(args):
        return PettingZooEnv(waterworld_v3.env(n_pursuers=5, n_evaders=5))

    env = env_creator({})
    register_env("waterworld", env_creator)
    obs_space = env.observation_space
    act_spc = env.action_space

    policies = {"shared_policy_1": (None, obs_space, act_spc, {}),
                "shared_policy_2": (None, obs_space, act_spc, {})
                }

    def policy_mapping_fn(agent_id):
        if agent_id == "pursuer_0" or "pursuer_1":
            return "shared_policy_1"
        else:
            return "shared_policy_2"

    tune.run(
        "PPO",
        name="PPO self play n = 5, M=5 trial 1",
        stop={"training_iteration":10}, #{"episodes_total": 50000},
        checkpoint_freq=10,
        config={
            # Enviroment specific
            "env": "waterworld",
            # General
            "framework": "torch",
            "callbacks": MyCallbacks,
            "num_gpus": 0,
            "num_workers": 0,
            # Method specific
            "multiagent": {
                "policies": policies,
                "policies_to_train": ["shared_policy_1"],
                "policy_mapping_fn": policy_mapping_fn,
            },
        },
    )

Hmm, that’s very interesting…

This definitely deserves an issue to be opened on Github. We probably need to go through the entire codebase to see where there might be other sources of unexpected behavior when switching from Local Mode True to False.

Also, ideally “False” is the ground truth, so this seems like a bug from first glance.

1 Like

Thank you very much @mannyv for the working version of the code.
Since it seems to be a bug do you want me to upload it on github or will you upload it, since you found out what is going on?

@george_sk,

If you don’t mind I would appreciate if you would file it.

1 Like

Hi @mannyv and @sven1977 ,

I want to help me clarify something on weight syncing on this example.

As I understand when I set weights for a policy I have to do the syncing manually. In this topic, https://discuss.ray.io/t/board-game-self-play-ppo/1425/13 Sven had proposed:

trainer.set_weights({"shared_policy_2": self.men[sel]})  #loading weights 

weights = ray.put(trainer.workers.local_worker().save())
trainer.workers.foreach_worker(
lambda w: w.restore(ray.get(weights))
            )

or

trainer.set_weights({"shared_policy_2": self.men[sel]})  #loading weights 

local_weights = trainer.workers.local_worker().get_weights()
trainer.workers.foreach_worker(lambda worker: worker.set_weights(local_weights))

and Manny here said he had an error with the first method and proposed:

trainer.set_weights({"shared_policy_2": self.men[sel]})

weights = ray.put(trainer.get_weights("shared_policy_2"))
trainer.workers.foreach_worker(
lambda w: w.set_weights(ray.get(weights))
            )

I want to ask if all these method are equivalent (meaning that they sync the weights).
Especially for Manny’s method since it does:

weights = ray.put(trainer.get_weights("shared_policy_2"))

does this sync the weights only for “shared_policy_2” or overwrites the weights of “shared_policy_1” with the ones of “shared_policy_2”, since it is applied on all the workers? Because it seems to me like that, but I am not sure how it works.
Also, all three methods run on my pc without errors.

Thanks,
George

Hey @george_sk , w/o looking much further through your code, could you take a look at this example script here (ray.rllib.examples.self_play_with_open_spiel_connect_4.py), which adds a self-play example script to RLlib using the connect-4 game on DM’s open-spiel?

It does learn a good policy properly (you can play against it at the end of the script :slight_smile: ), also in non-local mode!

Also, thanks @mannyv for digging into this bug. This is indeed very interesting and may have to be fixed.