How should you end a MultiAgentEnv episode?

How severe does this issue affect your experience of using Ray?

  • High: It blocks me to complete my task.

Hi,

I am following the documentation (Environments — Ray 2.0.0) to build a custom MultiAgentEnv (which inherits from ray.rllib.env.multi_agent_env — Ray 2.0.0).

At each step t, I return obs, rew, done, info, where obs contains the observations of the agents which will need to take an action at t+1 (i.e. it does not contain any keys for agents which are done), rew and done contain the reward and done variables for step t (i.e. it can contain agent keys which at step t became done), and info contains agent keys which are in obs.

However, at the final terminal step in my episode, I am getting the following error from RLlib:

2022-09-22 21:28:15,425	ERROR worker.py:399 -- Unhandled error (suppress with 'RAY_IGNORE_UNHANDLED_ERRORS=1'): ray::RolloutWorker.sample() (pid=991951, ip=128.40.41.23, repr=<ray.rllib.evaluation.rollout_worker.RolloutWorker object at 0x7f707ebe1b20>)
  File "/scratch/zciccwf/py36/envs/nmmo/lib/python3.9/site-packages/ray/rllib/evaluation/rollout_worker.py", line 806, in sample
    batches = [self.input_reader.next()]
  File "/scratch/zciccwf/py36/envs/nmmo/lib/python3.9/site-packages/ray/rllib/evaluation/sampler.py", line 92, in next
    batches = [self.get_data()]
  File "/scratch/zciccwf/py36/envs/nmmo/lib/python3.9/site-packages/ray/rllib/evaluation/sampler.py", line 282, in get_data
    item = next(self._env_runner)
  File "/scratch/zciccwf/py36/envs/nmmo/lib/python3.9/site-packages/ray/rllib/evaluation/sampler.py", line 684, in _env_runner
    active_envs, to_eval, outputs = _process_observations(
  File "/scratch/zciccwf/py36/envs/nmmo/lib/python3.9/site-packages/ray/rllib/evaluation/sampler.py", line 1041, in _process_observations
    ma_sample_batch = sample_collector.postprocess_episode(
  File "/scratch/zciccwf/py36/envs/nmmo/lib/python3.9/site-packages/ray/rllib/evaluation/collectors/simple_list_collector.py", line 435, in postprocess_episode
    pre_batch = collector.build_for_training(policy.view_requirements)
  File "/scratch/zciccwf/py36/envs/nmmo/lib/python3.9/site-packages/ray/rllib/evaluation/collectors/agent_collector.py", line 395, in build_for_training
    shifted_data_np = np.stack(shifted_data, 0)
  File "<__array_function__ internals>", line 180, in stack
  File "/scratch/zciccwf/py36/envs/nmmo/lib/python3.9/site-packages/numpy/core/shape_base.py", line 426, in stack
    raise ValueError('all input arrays must have the same shape')
ValueError: all input arrays must have the same shape

This is how my obs, rew, done, info returned by env.step() is structured at this final terminal step which is causing the error:

(RolloutWorker pid=991951) RLlibMultiAgentTeamBasedEnv obs keys: 0 dict_keys([])
(RolloutWorker pid=991951) RLlibMultiAgentTeamBasedEnv rew: 1 {20: -1}
(RolloutWorker pid=991951) RLlibMultiAgentTeamBasedEnv done: 2 {20: True, '__all__': True}
(RolloutWorker pid=991951) RLlibMultiAgentTeamBasedEnv info keys: 0 dict_keys([])

My observation_space is:

Dict(Entity:Dict(Continuous:Box(-1048576.0, 1048576.0, (100, 24), float32), Discrete:Box(0, 4096, (100, 5), int32), N:Box(0, 100, (1,), int32)), Item:Dict(Continuous:Box(-1048576.0, 1048576.0, (170, 16), float32), Discrete:Box(0, 4096, (170, 3), int32), N:Box(0, 170, (1,), int32)), Market:Dict(Continuous:Box(-1048576.0, 1048576.0, (170, 16), float32), Discrete:Box(0, 4096, (170, 3), int32), N:Box(0, 170, (1,), int32)), Tile:Dict(Continuous:Box(-1048576.0, 1048576.0, (225, 4), float32), Discrete:Box(0, 4096, (225, 3), int32), N:Box(0, 15, (1,), int32))), 'action_space': Dict(<class 'nmmo.io.action.Attack'>:Dict(<class 'nmmo.io.action.Style'>:Discrete(3), <class 'nmmo.io.action.Target'>:Discrete(100)), <class 'nmmo.io.action.Buy'>:Dict(<class 'nmmo.io.action.Item'>:Discrete(170)), <class 'nmmo.io.action.Comm'>:Dict(<class 'nmmo.io.action.Token'>:Discrete(170)), <class 'nmmo.io.action.Move'>:Dict(<class 'nmmo.io.action.Direction'>:Discrete(4)), <class 'nmmo.io.action.Sell'>:Dict(<class 'nmmo.io.action.Item'>:Discrete(170), <class 'nmmo.io.action.Price'>:Discrete(100)), <class 'nmmo.io.action.Use'>:Dict(<class 'nmmo.io.action.Item'>:Discrete(170)))

My question is: How should obs, rew, done, info be structured at the terminal step to avoid this error? From the documentation, which says the keys in obs can change, I thought what I had done would be fine, but RLlib seems to expect the observation data to be consistently shaped.

Hi @cwfparsonson,

You are seeing this error because each of these 4 dictionaries need to have exactly the same keys in them. The keys can change between steps but in the return of a step call they all must be consistent. In the example you provided, you need to add something like the following:

obs = {20: AN APPROPRIATELY SHAPED OBSERVATION HERE}
info = {20: {}}

Thanks for the reply @mannyv

If at the final terminal step I set obs[i] to be some dummy observation (for example, the initial observation seen by the agent at t=0), and set info[i] to an empty dict as you suggested, I still get the same error:

(RolloutWorker pid=1053424) RLlibMultiAgentTeamBasedEnv obs keys: 1 dict_keys([20])
(RolloutWorker pid=1053424) RLlibMultiAgentTeamBasedEnv rew: 1 {20: -1}
(RolloutWorker pid=1053424) RLlibMultiAgentTeamBasedEnv done: 2 {20: True, '__all__': True}
(RolloutWorker pid=1053424) RLlibMultiAgentTeamBasedEnv info keys: 1 dict_keys([20])
2022-09-22 22:31:56,826	ERROR worker.py:399 -- Unhandled error (suppress with 'RAY_IGNORE_UNHANDLED_ERRORS=1'): ray::RolloutWorker.sample() (pid=1053424, ip=128.40.41.23, repr=<ray.rllib.evaluation.rollout_worker.RolloutWorker object at 0x7f19b5211b80>)
  File "/scratch/zciccwf/py36/envs/nmmo/lib/python3.9/site-packages/ray/rllib/evaluation/rollout_worker.py", line 806, in sample
    batches = [self.input_reader.next()]
  File "/scratch/zciccwf/py36/envs/nmmo/lib/python3.9/site-packages/ray/rllib/evaluation/sampler.py", line 92, in next
    batches = [self.get_data()]
  File "/scratch/zciccwf/py36/envs/nmmo/lib/python3.9/site-packages/ray/rllib/evaluation/sampler.py", line 282, in get_data
    item = next(self._env_runner)
  File "/scratch/zciccwf/py36/envs/nmmo/lib/python3.9/site-packages/ray/rllib/evaluation/sampler.py", line 684, in _env_runner
    active_envs, to_eval, outputs = _process_observations(
  File "/scratch/zciccwf/py36/envs/nmmo/lib/python3.9/site-packages/ray/rllib/evaluation/sampler.py", line 1041, in _process_observations
    ma_sample_batch = sample_collector.postprocess_episode(
  File "/scratch/zciccwf/py36/envs/nmmo/lib/python3.9/site-packages/ray/rllib/evaluation/collectors/simple_list_collector.py", line 435, in postprocess_episode
    pre_batch = collector.build_for_training(policy.view_requirements)
  File "/scratch/zciccwf/py36/envs/nmmo/lib/python3.9/site-packages/ray/rllib/evaluation/collectors/agent_collector.py", line 395, in build_for_training
    shifted_data_np = np.stack(shifted_data, 0)
  File "<__array_function__ internals>", line 180, in stack
  File "/scratch/zciccwf/py36/envs/nmmo/lib/python3.9/site-packages/numpy/core/shape_base.py", line 426, in stack
    raise ValueError('all input arrays must have the same shape')
ValueError: all input arrays must have the same shape

@cwfparsonson,

When I get this error it usually means that I am returning inconsistently shaped observations. For example obs from reset and step are different or if I have conditional logic in step then one of the branches is not creating obs in same way.

One quick debug check for that is to sample from the action space on reset and step to see if you still get the same error.

One other thing to keep in mind is that the agents can have different observation shapes but all agents that map to the same policy must be shaped the same.

What do you mean by sampling from the action space to check if the observation is consistently shaped? Just print self.action_space.sample() to see if it’s correct?

As far as I can see, the observations for all 24 of my agents are the same in reset:

(RolloutWorker pid=1096252) Agent i: 1
(RolloutWorker pid=1096252) obs_type Entity obs_dim Continuous: (100, 24)
(RolloutWorker pid=1096252) obs_type Entity obs_dim Discrete: (100, 5)
(RolloutWorker pid=1096252) obs_type Entity obs_dim N: (1,)
(RolloutWorker pid=1096252) obs_type Item obs_dim Continuous: (170, 16)
(RolloutWorker pid=1096252) obs_type Item obs_dim Discrete: (170, 3)
(RolloutWorker pid=1096252) obs_type Item obs_dim N: (1,)
(RolloutWorker pid=1096252) obs_type Market obs_dim Continuous: (170, 16)
(RolloutWorker pid=1096252) obs_type Market obs_dim Discrete: (170, 3)
(RolloutWorker pid=1096252) obs_type Market obs_dim N: (1,)
(RolloutWorker pid=1096252) obs_type Tile obs_dim Continuous: (225, 4)
(RolloutWorker pid=1096252) obs_type Tile obs_dim Discrete: (225, 3)
(RolloutWorker pid=1096252) obs_type Tile obs_dim N: (1,)

as they are in step:

(RolloutWorker pid=1096252) Agent i: 3
(RolloutWorker pid=1096252) obs_type Entity obs_dim Continuous: (100, 24)
(RolloutWorker pid=1096252) obs_type Entity obs_dim Discrete: (100, 5)
(RolloutWorker pid=1096252) obs_type Entity obs_dim N: (1,)
(RolloutWorker pid=1096252) obs_type Item obs_dim Continuous: (170, 16)
(RolloutWorker pid=1096252) obs_type Item obs_dim Discrete: (170, 3)
(RolloutWorker pid=1096252) obs_type Item obs_dim N: (1,)
(RolloutWorker pid=1096252) obs_type Market obs_dim Continuous: (170, 16)
(RolloutWorker pid=1096252) obs_type Market obs_dim Discrete: (170, 3)
(RolloutWorker pid=1096252) obs_type Market obs_dim N: (1,)
(RolloutWorker pid=1096252) obs_type Tile obs_dim Continuous: (225, 4)
(RolloutWorker pid=1096252) obs_type Tile obs_dim Discrete: (225, 3)
(RolloutWorker pid=1096252) obs_type Tile obs_dim N: (1,)

Hi @cwfparsonson,

I put together a simple example that runs cleanly. I tried to make it similar to what your agents, action, and observations spaces might look like. Hopefully it is of some use to you.

Many thanks for providing the example @mannyv !

One thing I don’t quite understand from your example is how it is possible for the number of keys in the observation to change between steps.

This is what I thought should happen: If at t=0 (i.e. on env.reset()) I have 24 agents, then env.reset() should return an observation with 24 keys. Each of the 24 agents will take an action, so action_dict will have 24 keys passed to env.step(action_dict). Then, inside env.step(), if 4 of these 24 agents are now done after taking their action, we would expect env.step() at t=1 to return observations only for the agents which are not done, so the obs returned by env.step() at t=1 would now have 20 keys. However, the rew and done dicts returned by env.step() would have 24 keys, since they are returning data for the agents which took actions at the previous step (t=0).

I.e. In the above example, I would expect obs to have 24 keys at t=1 and rew, done to have 20 keys. This seems contradictory to what you said and to your example. Is my understanding incorrect? How is it possible to have the number of keys in the observation changing between steps if the above is incorrect?

Also, just to sanity check myself, from ray/agent_collector.py at master · ray-project/ray · GitHub, it looks like the things which could be causing this shape consistency error are:

  • info
  • obs
  • action

I am checking that the shapes of info, obs, and action are always the same, but I am just wondering if I am missing some other component which could be causing the shape error…

This is where the issues is. Within a step it does not change size. In your example on t=1 even though 4 agents are done you still provide an obs, rew, done, and info for them.

Once rllib receives a done for an agent it will stop requesting values for it. On the next step call for t=2 those four agent keys will not be in the action dictionary. But it will store a terminal observation, action, reward, done, and next_obs.

The obs you return when the agent is done becomes the next_obs entry.

The number of agents you return can be different than those passed into step if for example some agents are turn based and do not need actions for the next step.

Ah okay. Does the ‘terminal observation’ (the observation returned for a done agent) ever have any meaning? Or, as a hacky workaround, can I just return a dummy obs of zeros (shaped as in observation_space) and be safe? Does any RL algorithm ever make use of such a terminal observation for learning? The only thing I can think of it being useful for is for rendering final game positions and external analysis

As far as I know none of the algorithms in rllib learn from that observation.

For some algorithms like DQN they will be fed into the policy network to compute a value but that value is masked out in the loss with code like this:

If you use an lstm for example then rllib will pad all trajectories to be the same length and it pads with all zeros so there are already cases in the library that use all zeros. I think it should be safe to pass in all zeros but there may be edge cases I am not aware of.

Thanks for all your help @mannyv . Unfortunately even if I ensure the number of keys in obs, rew, done, info does not change in the step,

~~~ Step 1 ~~~
action_dict keys: 24 dict_keys([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24])
RLlibMultiAgentTeamBasedEnv obs keys: 24 dict_keys([1, 2, 4, 5, 6, 7, 8, 9, 11, 12, 14, 15, 16, 17, 18, 19, 20, 21, 23, 24, 3, 10, 13, 22])
RLlibMultiAgentTeamBasedEnv rew: 24 {1: 0, 2: 0, 4: 0, 5: 0, 6: 0, 7: 0, 8: 0, 3: -1, 9: 0, 11: 0, 12: 0, 14: 0, 15: 0, 16: 0, 10: -1, 13: -1, 17: 0, 18: 0, 19: 0, 20: 0, 21: 0, 23: 0, 24: 0, 22: -1}
RLlibMultiAgentTeamBasedEnv done: 25 {1: False, 2: False, 4: False, 5: False, 6: False, 7: False, 8: False, 3: True, 9: False, 11: False, 12: False, 14: False, 15: False, 16: False, 10: True, 13: True, 17: False, 18: False, 19: False, 20: False, 21: False, 23: False, 24: False, 22: True, '__all__': False}
RLlibMultiAgentTeamBasedEnv info keys: 24 dict_keys([1, 2, 4, 5, 6, 7, 8, 9, 11, 12, 14, 15, 16, 17, 18, 19, 20, 21, 23, 24, 3, 10, 13, 22])

~~~ Step 2 ~~~
action_dict keys: 20 dict_keys([1, 2, 4, 5, 6, 7, 8, 9, 11, 12, 14, 15, 16, 17, 18, 19, 20, 21, 23, 24])
RLlibMultiAgentTeamBasedEnv obs keys: 20 dict_keys([1, 2, 4, 5, 6, 7, 8, 9, 11, 12, 14, 15, 16, 17, 18, 19, 21, 23, 20, 24])
RLlibMultiAgentTeamBasedEnv rew: 20 {1: 0, 2: 0, 4: 0, 5: 0, 6: 0, 7: 0, 8: 0, 9: 0, 11: 0, 12: 0, 14: 0, 15: 0, 16: 0, 17: 0, 18: 0, 19: 0, 21: 0, 23: 0, 20: -1, 24: -1}
RLlibMultiAgentTeamBasedEnv done: 21 {1: False, 2: False, 4: False, 5: False, 6: False, 7: False, 8: False, 9: False, 11: False, 12: False, 14: False, 15: False, 16: False, 17: False, 18: False, 19: False, 21: False, 23: False, 20: True, 24: True, '__all__': False}
RLlibMultiAgentTeamBasedEnv info keys: 20 dict_keys([1, 2, 4, 5, 6, 7, 8, 9, 11, 12, 14, 15, 16, 17, 18, 19, 21, 23, 20, 24])

~~~ Step 3 ~~~
action_dict keys: 18 dict_keys([1, 2, 4, 5, 6, 7, 8, 9, 11, 12, 14, 15, 16, 17, 18, 19, 21, 23])
RLlibMultiAgentTeamBasedEnv obs keys: 18 dict_keys([1, 2, 4, 5, 6, 7, 8, 9, 11, 12, 14, 15, 16, 17, 18, 19, 21, 23])
RLlibMultiAgentTeamBasedEnv rew: 18 {1: 0, 2: 0, 4: 0, 5: 0, 6: 0, 7: 0, 8: 0, 9: 0, 11: 0, 12: 0, 14: 0, 15: 0, 16: 0, 17: 0, 18: 0, 19: 0, 21: 0, 23: 0}
RLlibMultiAgentTeamBasedEnv done: 19 {1: False, 2: False, 4: False, 5: False, 6: False, 7: False, 8: False, 9: False, 11: False, 12: False, 14: False, 15: False, 16: False, 17: False, 18: False, 19: False, 21: False, 23: False, '__all__': False}
RLlibMultiAgentTeamBasedEnv info keys: 18 dict_keys([1, 2, 4, 5, 6, 7, 8, 9, 11, 12, 14, 15, 16, 17, 18, 19, 21, 23])

I still get the same error:

File /scratch/zciccwf/py36/envs/nmmo/lib/python3.9/site-packages/ray/rllib/evaluation/sampler.py:1041, in _process_observations(worker, base_env, active_episodes, unfiltered_obs, rewards, dones, infos, horizon, multiple_episodes_in_batch, callbacks, soft_horizon, no_done_at_end, observation_fn, sample_collector)
   1032 # If, we are not allowed to pack the next episode into the same
   1033 # SampleBatch (batch_mode=complete_episodes) -> Build the
   1034 # MultiAgentBatch from a single episode and add it to "outputs".
   (...)
   1038 # (to e.g. properly flush and clean up the SampleCollector's buffers),
   1039 # but then discard the entire batch and don't return it.
   1040 if not episode.is_faulty or episode.length > 0:
-> 1041     ma_sample_batch = sample_collector.postprocess_episode(
   1042         episode,
   1043         is_done=is_done or (hit_horizon and not soft_horizon),
   1044         check_dones=check_dones,
   1045         build=episode.is_faulty or not multiple_episodes_in_batch,
   1046     )
   1047 if not episode.is_faulty:
   1048     if ma_sample_batch:

File /scratch/zciccwf/py36/envs/nmmo/lib/python3.9/site-packages/ray/rllib/evaluation/collectors/simple_list_collector.py:435, in SimpleListCollector.postprocess_episode(self, episode, is_done, check_dones, build)
    433     pid = self.agent_key_to_policy_id[(eps_id, agent_id)]
    434     policy = self.policy_map[pid]
--> 435     pre_batch = collector.build_for_training(policy.view_requirements)
    436     pre_batches[agent_id] = (policy, pre_batch)
    438 # Apply reward clipping before calling postprocessing functions.

File /scratch/zciccwf/py36/envs/nmmo/lib/python3.9/site-packages/ray/rllib/evaluation/collectors/agent_collector.py:395, in AgentCollector.build_for_training(self, view_requirements)
    392 # in some multi-agent cases shifted_data may be an empty list.
    393 # In this case we should just create an empty array and return it.
    394 if shifted_data:
--> 395     shifted_data_np = np.stack(shifted_data, 0)
    396 else:
    397     shifted_data_np = np.array(shifted_data)

File <__array_function__ internals>:180, in stack(*args, **kwargs)

File /scratch/zciccwf/py36/envs/nmmo/lib/python3.9/site-packages/numpy/core/shape_base.py:426, in stack(arrays, axis, out)
    424 shapes = {arr.shape for arr in arrays}
    425 if len(shapes) != 1:
--> 426     raise ValueError('all input arrays must have the same shape')
    428 result_ndim = arrays[0].ndim + 1
    429 axis = normalize_axis_index(axis, result_ndim)

ValueError: all input arrays must have the same shape

I will have to try and think what another possible cause of this is.

The actions and observations seem to always be the same shape… My agent is just randomly choosing a ‘Move’ action (with an integer action), and the rest of the actions are just set to None. Perhaps trying to pass such a dummy None action is causing some issue?:

rllib_actions returned by agent: [defaultdict(<class 'dict'>, {<class 'nmmo.io.action.Attack'>: {<class 'nmmo.io.action.Style'>: None, <class 'nmmo.io.action.Target'>: None}, <class 'nmmo.io.action.Buy'>: {<class 'nmmo.io.action.Item'>: None}, <class 'nmmo.io.action.Comm'>: {<class 'nmmo.io.action.Token'>: None}, <class 'nmmo.io.action.Move'>: {<class 'nmmo.io.action.Direction'>: <class 'nmmo.io.action.West'>}, <class 'nmmo.io.action.Sell'>: {<class 'nmmo.io.action.Item'>: None, <class 'nmmo.io.action.Price'>: None}, <class 'nmmo.io.action.Use'>: {<class 'nmmo.io.action.Item'>: None}})]

RLlib automatically flattens the obs, so when it is received by the agent it is of the form:

obs of agent: (1, 10939) [[  1.   1.   0. ...  40. 277.  15.]]

Which is of course different from the original observation_space dict I posted above, but I assume this isn’t an issue since it is done internally by RLlib.

@cwfparsonson,

Are you able to share the code your environment or provide a reproduction script?

I think it is probably the nones. Can you try sending self. action_space.sample() as the actions and see if that helps?

Edit: Just realized you have a custom agent. It’s probably not self.action_space. You might need to get it from the config.

Hi @mannyv,

Sorry for the slow reply, I’ve been bogged down with work recently.

What do you mean by getting the action space from the config?

I’ve put my repo on GitHub (GitHub - cwfparsonson/deep_nmmo) on the rllib_single_agent branch - I’m having a crack at integrating RLlib with the NeurIPS NMMO team-based environment (Files · master · Neural MMO / neurips2022-nmmo · GitLab). I’ve put a copy of my conda nmmo environment in deep_nmmo/environment.yaml at rllib_single_agent · cwfparsonson/deep_nmmo · GitHub, and a setup.py file in deep_nmmo/deep_nmmo at rllib_single_agent · cwfparsonson/deep_nmmo · GitHub to install my custom deep_nmmo library with python setup.py develop. Finally, I have put a notebook in deep_nmmo/contained_neurips2022nmmo_rllib_notebook.ipynb at rllib_single_agent · cwfparsonson/deep_nmmo · GitHub which tries to run the environment and produces the above error.

I suspect it might be tricky to set up and run yourself, but perhaps there is something obvious you can see at a glance which I am doing wrong. Sorry for not being able to put a clean version in a self-contained colab like you did, I just had to do a fair amount of custom wrapping to get to the point I have with integrating the neurips2022nmmo.TeamBasedEnv with RLlib so it was tough to port it all into one notebook.

@mannyv FYI I implemented a version which uses action_space.sample() for each action, so the actions must be in the action space and consistently shaped, but I still get the shape error.

Hi @cwfparsonson,

I cloned your repo and set up the environment just now. I cannot spend more time on this today but this is what I am seeing.

The first action is different from the rest of the actions as you can see from the screenshots below. This is likely because of something in your policy but I have not had time to dig into that. Hopefully this gives you a direction to check.

Note the shape differences: () vs (8,)

That’s very strange, I cannot find where any action is set to 0 as in your screenshot, and it is weird that RLlib has automatically flattened the nested action space dict for all actions except that first action you have found.

Do you know based on that screenshot where that action would have been chosen in the episode? Or is it just sampled randomly so could be anywhere (start, middle, or end)?

I will continue to try to find the source of this.