**How severe does this issue affect your experience of using Ray?**

- High: It blocks me to complete my task.

I’m troubleshooting a problem with my MARL experiment. I’m using RLlib with the PPO algorithm.

Part of my observation is a `Box`

with shape `(1,)`

, i.e. just one neuron, and the agents are learning a positive correlation between this neuron and some output neuron. I designed the environment such that a positive correlation and a negative correlation are equally lucrative, and that as soon as some of the agents decide on positive or negative, all other agents are incentivized to also choose the same correlation.

However, when I run the experiment, the agents always choose a positive correlation. This happens no matter how many times I run the experiment.

I’m wondering whether the initialization values of the model weights have anything to do with it. Maybe RLlib starts with some positive values or something that give all agents a starting bias to choose a positive correlation?

**Is there any argument in RLlib that allows me to control the starting weights?**

Thanks for your help,

Ram Rachum.