MultiAgentEnv works with PPO.train() but not tune.Tuner.fit()

How severe does this issue affect your experience of using Ray?

  • Low/Medium: It contributes to significant difficulty to complete my task, but I can work around it.

I have a basic MultiAgentEnv that runs fine with PPO.train() (or so it appears, I’m still learning Ray) but fails with tune.Tuner.fit(). The error is:

ValueError: The two structures don't have the same nested structure.

First structure: type=ndarray str=[0.1 0. ]

Second structure: type=OrderedDict str=OrderedDict([('agent_0', array([ 8.450692e+17, -5.816952e+16], dtype=float32)), ('agent_1', array([-7.0000346e+17, -9.6193201e+17], dtype=float32)), ('agent_2', array([-5.5688862e+17, 7.0265260e+17], dtype=float32)), ('agent_3', array([6.6857392e+17, 8.4390335e+17], dtype=float32)), ('agent_4', array([2.0008603e+17, 1.2134728e+17], dtype=float32)), ('agent_5', array([3.9633845e+17, 7.5621022e+17], dtype=float32)), ('agent_6', array([ 8.7897234e+17, -3.4141877e+17], dtype=float32)), ('agent_7', array([-3.0604514e+17, 3.4147052e+17], dtype=float32)), ('agent_8', array([-6.072403e+17, -9.624188e+17], dtype=float32)), ('agent_9', array([-4.1622738e+17, -5.9749262e+17], dtype=float32))])

More specifically: Substructure "type=OrderedDict str=OrderedDict([('agent_0', array([ 8.450692e+17, -5.816952e+16], dtype=float32)), ('agent_1', array([-7.0000346e+17, -9.6193201e+17], dtype=float32)), ('agent_2', array([-5.5688862e+17, 7.0265260e+17], dtype=float32)), ('agent_3', array([6.6857392e+17, 8.4390335e+17], dtype=float32)), ('agent_4', array([2.0008603e+17, 1.2134728e+17], dtype=float32)), ('agent_5', array([3.9633845e+17, 7.5621022e+17], dtype=float32)), ('agent_6', array([ 8.7897234e+17, -3.4141877e+17], dtype=float32)), ('agent_7', array([-3.0604514e+17, 3.4147052e+17], dtype=float32)), ('agent_8', array([-6.072403e+17, -9.624188e+17], dtype=float32)), ('agent_9', array([-4.1622738e+17, -5.9749262e+17], dtype=float32))])" is a sequence, while substructure "type=ndarray str=[0.1 0. ]" is not

Entire first structure:

.

Entire second structure:

OrderedDict([('agent_0', .), ('agent_1', .), ('agent_2', .), ('agent_3', .), ('agent_4', .), ('agent_5', .), ('agent_6', .), ('agent_7', .), ('agent_8', .), ('agent_9', .)])

(PPO pid=16357) 2023-12-01 20:39:59,702 ERROR actor_manager.py:500 -- Ray error, taking actor 1 out of service. ray::RolloutWorker.apply() (pid=16362, ip=127.0.0.1, actor_id=9d3d30bad174b049a0aebe7f01000000, repr=<ray.rllib.evaluation.rollout_worker.RolloutWorker object at 0x1483ff0a0>)

The code is here in a simplified form. Up until line 275 is just setup and class definitions. PPO.train() is run on line 309. tune.Tuner.fit() is run on line 341. The error clearly indicates that tune.Tuner.fit() is comparing the observation output of a single agent to the observation of all agents but I’m not sure why and even more confusing, the exact same class works with PPO.train(). I’m not sure what the difference is between the two, but I’m guessing it has something to do with the way the MultiAgentEnv is being used under the hood. Any help to get it working with tune.Tuner.fit() would be greatly appreciated!

Solved, sort of.

Policies passed to tune.Tuner.fit() do not infer the observation or action spaces, even if the Agent classes contain a self.observation_space attribute. In other words, each Policy sent to tune.Tuner.fit() via the param_space dictionary requires an observation_space and action_space attribute. In my example:

"policies": {
      "default_policy": PolicySpec(
          policy_class=RandomAction,
          observation_space=gym.spaces.Box(-1e18, 1e18, (2,)), # <-----
          action_space=gym.spaces.Discrete(3), # <-----
      ),
      "learned": PolicySpec(
          config=AlgorithmConfig.overrides(
              model={"use_lstm": True},
              framework_str="torch",
          ),
          observation_space=gym.spaces.Box(-1e18, 1e18, (2,)), # <-----
          action_space=gym.spaces.Discrete(3), # <-----
      ),
  },

While this solves the problem, it does not explain why not passing observation_space or action_space in PPO.train() works while doing the same to tune.Tuner.fit() does not.