Different step space for different agents

UserRLlib · July 23, 2021, 11:13am

Hello, I am working with the multi-agent RL using RLlib. I know all the agents need to take the action in each step.

But I am wondering if it is possible to let different agents have a different step space. For example, the agent 1 takes action at each step, while the agent 2 takes action every 2 steps.

Thanks in advance for your reply!

Lars_Simon_Zehnder · July 23, 2021, 12:41pm

You could write your own policy and overwrite the compute_actions() method of your policy in such a way that it computes an action or not dependent on a timestep variable in your policy class. Something like:

from ray.rllib.policy.policy import Policy
from ray.rllib.spaces.space_utils import flatten_to_single_ndarray

class MyPolicy(Policy):
      def __init__(self, 
                   observation_space, 
                   action_space, 
                   model_config, 
                   *args, 
                   **kwargs):
           # Keep the action_space for later sampling
           self.action_space = action_space
           # Variable to keep track of timestep. This is 
           # the variable you check for returning a regular action 
           # or not.
           self.timestep = 0
      def compute_actions(self,
                          obs_batch=None,
                          state_batches=None,
                          prev_action_batch=None,
                          prev_reward_batch=None,
                          info_batch=None,
                          episodes=None,
                          **kwargs):

          self.timestep += 1
          batch_size = obs_batch.shape[0]

          # Compute the actions
          actions = flatten_to_single_ndarray(
                          [self.action_space.sample() for i in range(batch_size)]).reshape(batch_size,-1)

          if self.timestep % 2 ==0:
             # Take actions every second step.
             pass;
          else:
             # In the other case return simply NaNs;
             # ('None' has sometimes some undesirable side effects).
             actions = np.full_like(actions, np.nan)

          return actions, [], {}

There might be possibly a more elegant solution (@sven1977 or @mannyv might now about this), but with this you could condition in your environment on the actions being NaNs.

Hope this helps.

mannyv · July 23, 2021, 12:57pm

Hi @UserRLlib,

Welcome to the forum. You do not need to have every agent in the observation on every step. The actions are computed based on the agents in the observation. If I am understanding your question correctly, if you only put agent2 in the observation dictionary every other step rllib would automatically give you what you are looking for.

Manny

UserRLlib · July 23, 2021, 1:25pm

Hi @mannyv and @Lars_Simon_Zehnder, thanks for your reply!

If I understand you correctly, in my multiagent env class env_multi (MultiAgentEnv):, the easiest way is to change the step function def step(self, action_dict): and let it return the agent2’s observation only every other step.

I will try it ASAP and get back to you soon.

Thanks again for your kind help again!

sven1977 · July 28, 2021, 8:01pm

Hey @Lars_Simon_Zehnder , @mannyv , @UserRLlib , thanks for the question and all the good answers so far. Just to add to this: There has been a recent improvement in RLlib and it now allows for a completely flexible handling of these turn-based cases:

github.com/ray-project/ray

[RLlib] Allow policies to be added/deleted on the fly.

ray-project:master ← sven1977:policy_support_add_and_delete

opened 04:40PM - 10 Jun 21 UTC

sven1977

+906 -218

We currently do not support turn-based games in RLlib (e.g. user1 action -> env …obs for user2 -> user2 action -> env obs for user1 -> user1 action, etc..). This PR adds that functionality to RLlib in addition to: - Option to add/remove policies on-the-fly, as well as change the policy_mapping_fn and the list of policies to train. - Added a (rudimentary) DeepMind open-spiel adapter for playing board/card-style multi-agent games that can be learnt with self-play. - Added a connect-4 example script that performs simple self-play via custom callbacks (no league-based training yet, just having main policy play against most previous version of itself). ## Why are these changes needed? ## Related issue number ## Checks - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [x] I've included any doc changes needed for https://docs.ray.io/en/master/. - [x] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [x] Unit tests - [ ] Release tests - [ ] This PR is not tested :(

You can now publish observations for n agents at any step (and leave out the observations of the other m agents). The n agents will then have to send actions next (the other m agents shouldn’t!).
You can publish rewards for any agent at any(!) time. RLlib will sum up rewards for an agent iff the agent does not have an observation accompanying the reward.

Before this fix, rewards would have to be “remembered” by the env and they could only be published together with the respective agent’s observation, which posed a burden for the user. E.g. imagine a game, where agent1 plays against agent2 and agent2’s reward depends on what agent1 just did. This is now much more easily implementable.

UserRLlib · July 29, 2021, 8:25am

Hi @sven1977 and @mannyv

Thanks for the reply, and thanks very much for the great work! After updating the ray to the latest version 1.5, RLlib doesn’t drop the error about mismatching in the obs and reward anymore. And I think it works well now. Cheers!

By the way, I failed to understand the second point you mentioned @sven1977 , “RLlib will sum up rewards for an agent if the agent does not have an observation accompanying the reward.”

For example, in my env, I got two agents and I design the env like this
step 0: act: agent1 → obs: agent2, reward: agent1
step 1: act: agent2 → obs: agent1, reward: agent2
step 2: act: agent1 → obs: agent2, reward: agent1
…

The obs and reward always do not match. Does this mean the policy will sum up rewards for each agent and wait until the end of the env to learn? Thanks in advance for your reply!

sven1977 · August 3, 2021, 5:41pm

Hey @UserRLlib , this looks great and RLlib should now be able to handle this setup (older versions would complain that agent1’s reward is published w/o agent1’s obs present).

What I meant by summing up, was:

step 0: act: agent1 → obs: agent2, reward: agent1
step 1: act: agent2 → obs: agent1, reward: agent1 (<- will add the previous reward to this one and then use that sum as the actual reward).

UserRLlib · August 11, 2021, 2:45pm

Hi @sven1977 , thanks very much for the clarification!

Topic		Replies	Views
Initialization of multiagent envs RLlib	8	478	August 31, 2022
Mutiagent - Different action space for different agents RLlib	8	1810	August 25, 2022
Multi-agent setting different step sizes for agents and how actions are passed? RLlib	2	610	April 26, 2022
MultiAgents type actions/observation space defined in environement RLlib	8	1386	May 10, 2022
Obervation space and action space in multi-agent env RLlib	3	400	August 14, 2021

Different step space for different agents

Related topics