Different step space for different agents

Hello, I am working with the multi-agent RL using RLlib. I know all the agents need to take the action in each step.

But I am wondering if it is possible to let different agents have a different step space. For example, the agent 1 takes action at each step, while the agent 2 takes action every 2 steps.

Thanks in advance for your reply!

1 Like

You could write your own policy and overwrite the compute_actions() method of your policy in such a way that it computes an action or not dependent on a timestep variable in your policy class. Something like:

from ray.rllib.policy.policy import Policy
from ray.rllib.spaces.space_utils import flatten_to_single_ndarray

class MyPolicy(Policy):
      def __init__(self, 
           # Keep the action_space for later sampling
           self.action_space = action_space
           # Variable to keep track of timestep. This is 
           # the variable you check for returning a regular action 
           # or not.
           self.timestep = 0
      def compute_actions(self,

          self.timestep += 1
          batch_size = obs_batch.shape[0]

          # Compute the actions
          actions = flatten_to_single_ndarray(
                          [self.action_space.sample() for i in range(batch_size)]).reshape(batch_size,-1)

          if self.timestep % 2 ==0:
             # Take actions every second step.
             # In the other case return simply NaNs;
             # ('None' has sometimes some undesirable side effects).
             actions = np.full_like(actions, np.nan)

          return actions, [], {}

There might be possibly a more elegant solution (@sven1977 or @mannyv might now about this), but with this you could condition in your environment on the actions being NaNs.

Hope this helps.

1 Like

Hi @UserRLlib,

Welcome to the forum. You do not need to have every agent in the observation on every step. The actions are computed based on the agents in the observation. If I am understanding your question correctly, if you only put agent2 in the observation dictionary every other step rllib would automatically give you what you are looking for.


1 Like

Hi @mannyv and @Lars_Simon_Zehnder, thanks for your reply!

If I understand you correctly, in my multiagent env class env_multi (MultiAgentEnv):, the easiest way is to change the step function def step(self, action_dict): and let it return the agent2’s observation only every other step.

I will try it ASAP and get back to you soon.

Thanks again for your kind help again!

1 Like

Hey @Lars_Simon_Zehnder , @mannyv , @UserRLlib , thanks for the question and all the good answers so far. :slight_smile: Just to add to this: There has been a recent improvement in RLlib and it now allows for a completely flexible handling of these turn-based cases:

  • You can now publish observations for n agents at any step (and leave out the observations of the other m agents). The n agents will then have to send actions next (the other m agents shouldn’t!).
  • You can publish rewards for any agent at any(!) time. RLlib will sum up rewards for an agent iff the agent does not have an observation accompanying the reward.

Before this fix, rewards would have to be “remembered” by the env and they could only be published together with the respective agent’s observation, which posed a burden for the user. E.g. imagine a game, where agent1 plays against agent2 and agent2’s reward depends on what agent1 just did. This is now much more easily implementable.

Hi @sven1977 and @mannyv

Thanks for the reply, and thanks very much for the great work! After updating the ray to the latest version 1.5, RLlib doesn’t drop the error about mismatching in the obs and reward anymore. And I think it works well now. Cheers!

By the way, I failed to understand the second point you mentioned @sven1977 , “RLlib will sum up rewards for an agent if the agent does not have an observation accompanying the reward.”

For example, in my env, I got two agents and I design the env like this
step 0: act: agent1 → obs: agent2, reward: agent1
step 1: act: agent2 → obs: agent1, reward: agent2
step 2: act: agent1 → obs: agent2, reward: agent1

The obs and reward always do not match. Does this mean the policy will sum up rewards for each agent and wait until the end of the env to learn? Thanks in advance for your reply!

Hey @UserRLlib , this looks great and RLlib should now be able to handle this setup (older versions would complain that agent1’s reward is published w/o agent1’s obs present).

What I meant by summing up, was:

step 0: act: agent1 → obs: agent2, reward: agent1
step 1: act: agent2 → obs: agent1, reward: agent1 (<- will add the previous reward to this one and then use that sum as the actual reward).

Hi @sven1977 , thanks very much for the clarification! :+1: