[rlrlib] MBMPO error after Training Dynamics Ensemble

How severe does this issue affect your experience of using Ray?

  • High: It blocks me to complete my task.

I’m running a custom multi continuous action space environment using the MBMPO algorithm, rllib version 2.6.1.
Everything is running well until the end of Training Dynamics Ensemble, when I get the following error:
File “/usr/local/lib/python3.6/dist-packages/ray/tune/trainable/trainable.py”, line 384, in train
raise skipped from exception_cause(skipped)
File “/usr/local/lib/python3.6/dist-packages/ray/tune/trainable/trainable.py”, line 381, in train
result = self.step()
File “/usr/local/lib/python3.6/dist-packages/ray/rllib/algorithms/algorithm.py”, line 792, in step
results, train_iter_ctx = self._run_one_training_iteration()
File “/usr/local/lib/python3.6/dist-packages/ray/rllib/algorithms/algorithm.py”, line 2813, in _run_one_training_iteration
results = next(self.train_exec_impl)
File “/usr/local/lib/python3.6/dist-packages/ray/util/iter.py”, line 786, in next
return next(self.built_iterator)
File “/usr/local/lib/python3.6/dist-packages/ray/util/iter.py”, line 911, in apply_flatten
for item in it:
File “/usr/local/lib/python3.6/dist-packages/ray/util/iter.py”, line 814, in apply_foreach
for item in it:
File “/usr/local/lib/python3.6/dist-packages/ray/util/iter.py”, line 798, in apply_transform
for item in fn(it):
File “/usr/local/lib/python3.6/dist-packages/ray/rllib/algorithms/mbmpo/mbmpo.py”, line 541, in inner_adaptation_steps
for samples in itr:
File “/usr/local/lib/python3.6/dist-packages/ray/util/iter.py”, line 498, in base_iterator
yield ray.get(futures, timeout=timeout)
File “/usr/local/lib/python3.6/dist-packages/ray/_private/client_mode_hook.py”, line 105, in wrapper
return func(*args, **kwargs)
File “/usr/local/lib/python3.6/dist-packages/ray/_private/worker.py”, line 2521, in get
raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(TypeError): ray::RolloutWorker.par_iter_next() (pid=16422, ip=172.17.0.2, repr=<ray.rllib.evaluation.rollout_worker.RolloutWorker object at 0x7f5ac0edd048>)
File “/usr/local/lib/python3.6/dist-packages/ray/util/iter.py”, line 1194, in par_iter_next
return next(self.local_it)
File “/usr/local/lib/python3.6/dist-packages/ray/rllib/evaluation/rollout_worker.py”, line 490, in gen_rollouts
yield self.sample()
File “/usr/local/lib/python3.6/dist-packages/ray/rllib/evaluation/rollout_worker.py”, line 915, in sample
batches = [self.input_reader.next()]
File “/usr/local/lib/python3.6/dist-packages/ray/rllib/evaluation/sampler.py”, line 92, in next
batches = [self.get_data()]
File “/usr/local/lib/python3.6/dist-packages/ray/rllib/evaluation/sampler.py”, line 277, in get_data
item = next(self._env_runner)
File “/usr/local/lib/python3.6/dist-packages/ray/rllib/evaluation/env_runner_v2.py”, line 323, in run
outputs = self.step()
File “/usr/local/lib/python3.6/dist-packages/ray/rllib/evaluation/env_runner_v2.py”, line 379, in step
self._base_env.send_actions(actions_to_send)
File “/usr/local/lib/python3.6/dist-packages/ray/rllib/env/vector_env.py”, line 464, in send_actions
) = self.vector_env.vector_step(action_vector)
File “/usr/local/lib/python3.6/dist-packages/ray/rllib/env/wrappers/model_vector_env.py”, line 156, in vector_step
list(rew_batch),
TypeError: ‘int’ object is not iterable

My code is:
register_env(“custom_env”, lambda config: MyEnv(config))
env=MyEnv(‘’)

config = MBMPOConfig()

config = config.resources(num_gpus=1)

config = config.rollouts(num_rollout_workers=8)

config.log_level=‘DEBUG’

print(config.to_dict())

algo = config.build(env=“custom_env”)

algo.train()

Any suggestion?
Thanks

@blackpanther The error looks to me as if it originates from your environment (either a vector env itself or automatically wrapped into one by RLlib).

I would take a closer look into the output of your step() function

Many thanks @Lars_Simon_Zehnder for your guidance, it took me a lot of time debugging it due to the complexity caused by the multiple threads during execution.
I noticed that the method:

def vector_step(self, actions):
    if self.cur_obs is None:
        raise ValueError("Need to reset env first")

    for idx in range(self.num_envs):
        self._timesteps[idx] += 1

    # If discrete, need to one-hot actions
    if isinstance(self.action_space, Discrete):
        act = np.array(actions)
        new_act = np.zeros((act.size, act.max() + 1))
        new_act[np.arange(act.size), act] = 1
        actions = new_act.astype("float32")

    # Batch the TD-model prediction.
    obs_batch = np.stack(self.cur_obs, axis=0)
    action_batch = np.stack(actions, axis=0)
    # Predict the next observation, given previous a) real obs
    # (after a reset), b) predicted obs (any other time).
    next_obs_batch = self.model.predict_model_batches(
        obs_batch, action_batch, device=self.device
    )
    next_obs_batch = np.clip(next_obs_batch, -1000, 1000)

    # Call env's reward function.
    # Note: Each actual env must implement one to output exact rewards.
    rew_batch = self.envs[0].reward(obs_batch, action_batch, next_obs_batch)

    # If env has a `done` method, use it.
    if hasattr(self.envs[0], "done"):
        dones_batch = self.envs[0].done(next_obs_batch)
    # Our sub-environments have timestep limits.
    elif hasattr(self.envs[0], "_max_episode_steps"):
        dones_batch = np.array(
            [
                self._timesteps[idx] >= self.envs[0]._max_episode_steps
                for idx in range(self.num_envs)
            ]
        )
    # Otherwise, assume the episode does not end.
    else:
        dones_batch = np.asarray([False for _ in range(self.num_envs)])
    truncateds_batch = [False for _ in range(self.num_envs)]

    info_batch = [{} for _ in range(self.num_envs)]

    self.cur_obs = next_obs_batch
    ##print(next_obs_batch.shape,rew_batch)
    return (
        list(next_obs_batch),
        [rew_batch],##list(rew_batch),
        list(dones_batch),
        truncateds_batch,
        info_batch,
    ) 

Apparently Is expecting a list of rewards from the env.reward() method, but my custom environment is only returning one value. It could be caused by any misunderstanding about this method implementation, as an investigation I’ve changed the vector_step to return my unique value, but this caused the program to run until get out of memory.
It is strange that in the rew_batch = self.envs[0].reward(obs_batch, action_batch, next_obs_batch) → action_batch is a unique action which yield a single reward always.

My reward implementation returns a reward for a given synthetic action. Since my environment is based on sequence, I only move to the next time step when the step method is called. Based on the mbmpo paper I understand that this algorithm explores synthetic actions as part of the model based approach, so to get multiple rewards multiple actions must be sent to reward method which is not happening by my debug observation. Could you please clarify?

Thanks

@blackpanther, hard to say what exactly exhausts the memory here, without debugging it. Maybe a look into the Mujoco MBMPO environments here give you an idea if the reward function is correctly specified. Either your reward function does not return a batch where expected or you need to run the reward function for all subenvs.

Another point I saw: I am unsure about it, but is setting

new_act = np.zeros((act.size, act.max() + 1)) 

save? What if the maximum observed action is not the maximum possible action in the Discrete action space?

Thanks again @Lars_Simon_Zehnder , the examples helped a lot since I could not find any info about the reward method implementation at the Open AI documentation. This seems to be the root cause of the error.
It made me study deeper MBMPO , I’ve realized it will probably not be a good fit to my use case, since my environment is stochastic so the synthetic model will also struggle to learn and possibly not perform better than my past tries.
Maybe you have a recommendation here, I tried MBMPO after failures converging with A3C and PPO, I guessed a model based algorithm approach could be more efficient considering the complexity of my environment which would fit to a partially observable MDP, any suggestions are very welcome!