Training for turn-based sequential games

How severe does this issue affect your experience of using Ray?

  • High: It blocks me to complete my task.

Hi, I have a game that doesn’t seem to fit the MultiAgent environment class derived from Gym.

My game is a simple “divide the dollar” bargaining game with two agents. This is a sequential turn-based games where in simple terms, two agents take an action sequentially and the reward is calculated after everyone has taken actions.

More specifically, the sequence goes as follows:

Agent1 proposes a division for the dollar (say 40%-60%). The proposal is queried from agent’s 1 policy. This proposal (i.e. the action by Agent1) is passed as the observation to Agent2 which queries its policy (which can be a shared parameter Net with Agent1) and replies {yes, no} to the proposal. Only then the reward is calculated: If proposal was accepted everyone gets the proposed amount and game terminates. Otherwise, we go to another round of proposal.

In summary:
Agent1’s policy → proposal → passed as observation to Agent 2-> Agent’s 2 policy-> reply → {yes/no} → Reward or next round.

To me, it seems like I need to call the policy twice, one for the proposal (without calculating any reward) and a second time for the responder, and only then a reward will be calculated.

This is tricky to code in the way the step() method is expected to return a reward every time the policy is called.

I have two options:

  1. A hacky way is to call the step() method twice, to get the two actions needed to get the reward. The firs time, for the proposer, the reward that the method needs to return will be a dummy value, say 0. The second time for the responder, the reward returned is the true game reward.

Question: I am not sure if this will train the policy (remember, we have one policy with two heads, one learns to propose, the other head learns to respond and not two different policies trained independently).

  1. The right way to do this, IMHO would seem to be to train a policy like this below, but I don’t know if RLLIB supports a training like this:

    Question: this seems to be the right way to train a game like this, but not sure if it’s supported in RLLIB.

env = Env()
obs = env.reset()
episode_reward = 0
done           = False
for _ in range(10):     
       proposal = policy_to_train.compute_action(obs)  #action of agent 1 is used as observation of agent 2
       response = policy_to_train.compute_action(proposal )  #respond to Agent's 1 action
       response, reward, dones, info = env.step(proposal, response) #get reward, decides if games continues.

How is the right way to build this Env in MARL RLLIB?
Thanks so much!

Hi @Username1,

RLLIB should support this as is. You just need do set up your environment the correct way. The key thing to realize is that an agent will accumulate rewards on all the timesteps between observations for that agent and attribute that reward to the most recent timestep the agent had an observation.

If you have an environment that looks something like this then it will work to give both agents the same rewards.

I adapted this from the rllib random env:

class TurnEnv(MultiAgentEnv):
    """A randomly acting turn based environment.


    def __init__(self, config=None):
        config = config or {}

        # Action space.
        self.action_space = config.get("action_space", Discrete(2))
        # Observation space from which to sample.
        self.observation_space = config.get("observation_space", Discrete(2))
        # Reward space from which to sample.
        self.reward_space = config.get(
            gym.spaces.Box(low=-1.0, high=1.0, shape=(), dtype=np.float32),

        self.max_episode_len = config.get("max_episode_len", 9)
        # Steps taken so far (after last reset).
        self.steps = 0
        self.num_agents = 2

    def reset(self):
        self.steps = 0
        return {0: self.observation_space.sample()}

    def step(self, action):
        self.steps += 1
        agent = self.steps % 2
        obs = {agent: self.observation_space.sample()}
        rew = np.abs(self.reward_space.sample())
        info = {agent: {}}
        if agent == 1:
            rew = {0 : rew*0}
            done = {0: False, '__all__': False}
            done = {0: False, 1: False, '__all__': False}
            rew = {0: rew,
                   1: copy.deepcopy(rew)
        if self.steps > self.max_episode_len:
            done = {0: True, 1: True, '__all__': True}

        return obs, rew, done, info

Here is a screen shot showing both agents have the same rewards for each turn:

One place that needs some more thought is if the boundary for when a training batch is cut off is between steps of the same turn. Then the proposer could end up in one training batch and agent two could end up in the next training batch. I’ll will leave that for you to decide if it matters or not.

Hi @mannyv , thank you very much for this reply. Let me confirm my understanding. From what I see in your code, what you are doing is a ‘hack’ to shoehorn the gym env into a sequential problem. You are returning a dummy number of reward = 0.

I am referring to this line: rew = {0 : rew*0}

The environment has 2 agents, but you are just passing a “centinel” dummy reward to just one agent. This was my proposed solution (1) if I understood you correctly.

If I understood correctly, in a game like OpenAI’s Dota 2, the pseudocode will be as follows. In this game you have 5 agents, each take actions in turn, observing the other agent’s action as observation and only after the 5 have taken actions, a reward is calculated. They all pull actions from a single policy net. The only way to code this in a gym env is as follows:

agents = [1,2,3,4,5]

for agent in agents:
     if agent <5:  #before all agents have acted, we don't have a reward yet
        reward = 0 #zero is a dummy value, policy net should disregard it (fingers crossed)
        return reward, observation_for_next_agent
     else: #only after everyone has acted we calculate the reward
        reward = {agent: some_reward_nbr for agent in agents} #assign true reward
       return: reward, observation_for_next_agent    

In this way, we hope that the net will disregard the zero dummy value when training and only learn from the true values.

Please let me know if my understanding is correct. Thank you very much for your time.

Hi @Username1,

Yes your understanding is correct. And your pseudo code looks good to me.

I would not call it a dummy value, it is a real actually used reward it just happens to be zero. It is not discarded in any way.

The way rllib works is like this: let’s say agent has an observation and takes an action on step t then it does this again on step t+5. If it receives any rewards on steps t+1 through t+4 those will all be added together and attributed to the action taken on step t.

It is the presence of an observation that seperates when the reward is credited. In the learning phase there is no representation of timesteps without observations. If the episode is 10 steps long but the agent only had 5 observations then it’s training data would only have 5 vales for obs, action, reward, etc… It does include t as a value so ou can recover which step that data was generated on if needed but. most rl. algorithms do not use that.

1 Like

Thank you very much for your detailed reply. This is excellent design and I am sure for many people coding sequential turn-based games your explanation will be useful.