How severe does this issue affect your experience of using Ray?
- High: It blocks me to complete my task.
Hi, I’m working with a turn-based game environment. I’ve successfully trained a selfplay agent with a single agent environment: on each step, the environment alternates between showing the obs for player_0 and player_1.
Now, I’m trying to replicate the same selfplay results with a MultiAgentEnv, but it looks like I’m missing some critical difference when switching to a MultiAgentEnv. I’ve extended the env so that the step function now returns the observation nested under the “player_0” or “player_1” key depending on the turn:
{
"player_0": <{obs dict from before}>
}
and I’ve added the multiagent
key to my training script:
ray.init(local_mode=False)
config = {
"num_workers": 0,
"num_gpus": 1,
"num_envs_per_worker": 128,
"framework": "torch",
"disable_env_checking": True,
"_disable_preprocessor_api": True,
"_disable_action_flattening": True,
"env": "my_selfplay_env",
"model": {
"custom_model": MyModel,
"custom_action_dist": MyActionDist,
},
"multiagent": {
"policies": {
"main": PolicySpec(),
},
"policy_mapping_fn": lambda agnt_id, episode, worker, **kwargs: "main",
"policies_to_train": ["main"],
},
}
trainer = PPOTrainer(config)
...
I would expect this to do exactly the same thing as the selfplay setup I had before, but it looks like my results are higher variance and the rewards are lower using the new setup, across multiple seeds. This makes me think that I’ve misconfigured this somehow such that there’s the policy is being optimized differently in this new setup (e.g. is this training one set of parameters for both players, like I’d expect?). What am I missing?