Different step space for different agents

Hi @sven1977 and @mannyv

Thanks for the reply, and thanks very much for the great work! After updating the ray to the latest version 1.5, RLlib doesn’t drop the error about mismatching in the obs and reward anymore. And I think it works well now. Cheers!

By the way, I failed to understand the second point you mentioned @sven1977 , “RLlib will sum up rewards for an agent if the agent does not have an observation accompanying the reward.”

For example, in my env, I got two agents and I design the env like this
step 0: act: agent1 → obs: agent2, reward: agent1
step 1: act: agent2 → obs: agent1, reward: agent2
step 2: act: agent1 → obs: agent2, reward: agent1

The obs and reward always do not match. Does this mean the policy will sum up rewards for each agent and wait until the end of the env to learn? Thanks in advance for your reply!