How severe does this issue affect your experience of using Ray?
- High: It blocks me to complete my task.
Hi, I have a game that doesn’t seem to fit the MultiAgent environment class derived from Gym.
My game is a simple “divide the dollar” bargaining game with two agents. This is a sequential turn-based games where in simple terms, two agents take an action sequentially and the reward is calculated after everyone has taken actions.
More specifically, the sequence goes as follows:
Agent1 proposes a division for the dollar (say 40%-60%). The proposal is queried from agent’s 1 policy. This proposal (i.e. the action by Agent1) is passed as the observation to Agent2 which queries its policy (which can be a shared parameter Net with Agent1) and replies {yes, no} to the proposal. Only then the reward is calculated: If proposal was accepted everyone gets the proposed amount and game terminates. Otherwise, we go to another round of proposal.
In summary:
Agent1’s policy → proposal → passed as observation to Agent 2-> Agent’s 2 policy-> reply → {yes/no} → Reward or next round.
To me, it seems like I need to call the policy twice, one for the proposal (without calculating any reward) and a second time for the responder, and only then a reward will be calculated.
This is tricky to code in the way the step()
method is expected to return a reward every time the policy is called.
I have two options:
- A hacky way is to call the
step()
method twice, to get the two actions needed to get the reward. The firs time, for the proposer, the reward that the method needs to return will be a dummy value, say 0. The second time for the responder, the reward returned is the true game reward.
Question: I am not sure if this will train the policy (remember, we have one policy with two heads, one learns to propose, the other head learns to respond and not two different policies trained independently).
-
The right way to do this, IMHO would seem to be to train a policy like this below, but I don’t know if RLLIB supports a training like this:
Question: this seems to be the right way to train a game like this, but not sure if it’s supported in RLLIB.
env = Env()
obs = env.reset()
episode_reward = 0
done = False
for _ in range(10):
proposal = policy_to_train.compute_action(obs) #action of agent 1 is used as observation of agent 2
response = policy_to_train.compute_action(proposal ) #respond to Agent's 1 action
response, reward, dones, info = env.step(proposal, response) #get reward, decides if games continues.
How is the right way to build this Env in MARL RLLIB?
Thanks so much!