A RolloutWorker died computing advantages

hermmanhender · July 31, 2024, 9:55am

How severe does this issue affect your experience of using Ray?

Low: It annoys or frustrates me for a moment.

Hi!
I’m training a policy model with PPO algorithm. I have a 8 CPU in my computer and I’m running one worker in each of them.

One of the rollout workers presents a recurrent error:

:actor_name:RolloutWorker
2024-07-31 11:35:59,525	ERROR actor_manager.py:182 -- Worker exception caught during `apply()`: operands could not be broadcast together with shapes (1142,) (1141,) 
Traceback (most recent call last):
  File "c:\Users\grhen\anaconda3\envs\eprllib1-2-5\lib\site-packages\ray\rllib\utils\actor_manager.py", line 178, in apply
    return func(self, *args, **kwargs)
  File "c:\Users\grhen\anaconda3\envs\eprllib1-2-5\lib\site-packages\ray\rllib\execution\rollout_ops.py", line 99, in <lambda>
    (lambda w: w.sample())
  File "c:\Users\grhen\anaconda3\envs\eprllib1-2-5\lib\site-packages\ray\util\tracing\tracing_helper.py", line 467, in _resume_span
    return method(self, *_args, **_kwargs)
  File "c:\Users\grhen\anaconda3\envs\eprllib1-2-5\lib\site-packages\ray\rllib\evaluation\rollout_worker.py", line 685, in sample
    batches = [self.input_reader.next()]
  File "c:\Users\grhen\anaconda3\envs\eprllib1-2-5\lib\site-packages\ray\rllib\evaluation\sampler.py", line 91, in next
    batches = [self.get_data()]
  File "c:\Users\grhen\anaconda3\envs\eprllib1-2-5\lib\site-packages\ray\rllib\evaluation\sampler.py", line 273, in get_data
    item = next(self._env_runner)
  File "c:\Users\grhen\anaconda3\envs\eprllib1-2-5\lib\site-packages\ray\rllib\evaluation\env_runner_v2.py", line 348, in run
    outputs = self.step()
  File "c:\Users\grhen\anaconda3\envs\eprllib1-2-5\lib\site-packages\ray\rllib\evaluation\env_runner_v2.py", line 374, in step
    active_envs, to_eval, outputs = self._process_observations(
  File "c:\Users\grhen\anaconda3\envs\eprllib1-2-5\lib\site-packages\ray\rllib\evaluation\env_runner_v2.py", line 703, in _process_observations
    sample_batch = self._try_build_truncated_episode_multi_agent_batch(
  File "c:\Users\grhen\anaconda3\envs\eprllib1-2-5\lib\site-packages\ray\rllib\evaluation\env_runner_v2.py", line 1004, in _try_build_truncated_episode_multi_agent_batch
    episode.postprocess_episode(batch_builder=batch_builder, is_done=False)
  File "c:\Users\grhen\anaconda3\envs\eprllib1-2-5\lib\site-packages\ray\rllib\evaluation\episode_v2.py", line 320, in postprocess_episode
    post_batch = policy.postprocess_trajectory(post_batch, other_batches, self)
  File "c:\Users\grhen\anaconda3\envs\eprllib1-2-5\lib\site-packages\ray\rllib\algorithms\ppo\ppo_torch_policy.py", line 215, in postprocess_trajectory
    return compute_gae_for_sample_batch(
  File "c:\Users\grhen\anaconda3\envs\eprllib1-2-5\lib\site-packages\ray\rllib\evaluation\postprocessing.py", line 204, in compute_gae_for_sample_batch
    batch = compute_advantages(
  File "c:\Users\grhen\anaconda3\envs\eprllib1-2-5\lib\site-packages\ray\rllib\evaluation\postprocessing.py", line 128, in compute_advantages
    delta_t = rewards + gamma * vpred_t[1:] - vpred_t[:-1]
ValueError: operands could not be broadcast together with shapes (1142,) (1141,)

After the error, the worker is restarted and the training continue. I don’t have problems with that. But I would like to know why this is happening? It this something that I can fix or is it a normal error in the training process?

Topic		Replies	Views
ValueError: RolloutWorker has no `input_reader` object! RLlib	8	563	March 6, 2024
Custom model with LSTM crashes PPO sampler.py RLlib	0	264	November 24, 2023
Confusion migrating to new API RLlib	5	214	February 21, 2025
Error running intro code RLlib	2	148	March 21, 2024
Tf2 error with LSTM but not with torch framework Configure Algorithm, Training, Evaluation, Scaling	0	112	May 16, 2024

A RolloutWorker died computing advantages

Related topics