KeyError: 'obs' In tower 0 on device cpu

Traceback (most recent call last):
File “/scratch/ftonti/miniconda3/envs/rllib/lib/python3.11/site-packages/ray/air/execution/_internal/event_manager.py”, line 110, in resolve_future
result = ray.get(future)
^^^^^^^^^^^^^^^
File “/scratch/ftonti/miniconda3/envs/rllib/lib/python3.11/site-packages/ray/_private/auto_init_hook.py”, line 21, in auto_init_wrapper
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File “/scratch/ftonti/miniconda3/envs/rllib/lib/python3.11/site-packages/ray/_private/client_mode_hook.py”, line 103, in wrapper
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File “/scratch/ftonti/miniconda3/envs/rllib/lib/python3.11/site-packages/ray/_private/worker.py”, line 2623, in get
values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/scratch/ftonti/miniconda3/envs/rllib/lib/python3.11/site-packages/ray/_private/worker.py”, line 861, in get_objects
raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(ValueError): ray::PPO.train() (pid=285359, ip=130.237.233.123, actor_id=1a4964d041decbd253209b4601000000, repr=PPO)
File “/scratch/ftonti/miniconda3/envs/rllib/lib/python3.11/site-packages/ray/rllib/algorithms/ppo/ppo_torch_policy.py”, line 84, in loss
logits, state = model(train_batch)
^^^^^^^^^^^^^^^^^^
File “/scratch/ftonti/miniconda3/envs/rllib/lib/python3.11/site-packages/ray/rllib/models/modelv2.py”, line 244, in call
input_dict[“obs”], self.obs_space, self.framework
~~~~~~~~~~^^^^^^^
File “/scratch/ftonti/miniconda3/envs/rllib/lib/python3.11/site-packages/ray/rllib/policy/sample_batch.py”, line 950, in getitem
value = dict.getitem(self, key)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
KeyError: ‘obs’

The above exception was the direct cause of the following exception:

ray::PPO.train() (pid=285359, ip=130.237.233.123, actor_id=1a4964d041decbd253209b4601000000, repr=PPO)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/scratch/ftonti/miniconda3/envs/rllib/lib/python3.11/site-packages/ray/tune/trainable/trainable.py”, line 331, in train
raise skipped from exception_cause(skipped)
File “/scratch/ftonti/miniconda3/envs/rllib/lib/python3.11/site-packages/ray/tune/trainable/trainable.py”, line 328, in train
result = self.step()
^^^^^^^^^^^
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/scratch/ftonti/miniconda3/envs/rllib/lib/python3.11/site-packages/ray/rllib/algorithms/algorithm.py”, line 878, in step
train_results, train_iter_ctx = self._run_one_training_iteration()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/scratch/ftonti/miniconda3/envs/rllib/lib/python3.11/site-packages/ray/rllib/algorithms/algorithm.py”, line 3156, in _run_one_training_iteration
results = self.training_step()
^^^^^^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/scratch/ftonti/miniconda3/envs/rllib/lib/python3.11/site-packages/ray/rllib/algorithms/ppo/ppo.py”, line 426, in training_step
return self._training_step_old_and_hybrid_api_stacks()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/scratch/ftonti/miniconda3/envs/rllib/lib/python3.11/site-packages/ray/rllib/algorithms/ppo/ppo.py”, line 601, in _training_step_old_and_hybrid_api_stacks
train_results = multi_gpu_train_one_step(self, train_batch)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/scratch/ftonti/miniconda3/envs/rllib/lib/python3.11/site-packages/ray/rllib/execution/train_ops.py”, line 176, in multi_gpu_train_one_step
results = policy.learn_on_loaded_batch(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/scratch/ftonti/miniconda3/envs/rllib/lib/python3.11/site-packages/ray/rllib/policy/torch_policy_v2.py”, line 831, in learn_on_loaded_batch
return self.learn_on_batch(batch)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/scratch/ftonti/miniconda3/envs/rllib/lib/python3.11/site-packages/ray/rllib/utils/threading.py”, line 24, in wrapper
return func(self, *a, **k)
^^^^^^^^^^^^^^^^^^^
File “/scratch/ftonti/miniconda3/envs/rllib/lib/python3.11/site-packages/ray/rllib/policy/torch_policy_v2.py”, line 712, in learn_on_batch
grads, fetches = self.compute_gradients(postprocessed_batch)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/scratch/ftonti/miniconda3/envs/rllib/lib/python3.11/site-packages/ray/rllib/utils/threading.py”, line 24, in wrapper
return func(self, *a, **k)
^^^^^^^^^^^^^^^^^^^
File “/scratch/ftonti/miniconda3/envs/rllib/lib/python3.11/site-packages/ray/rllib/policy/torch_policy_v2.py”, line 924, in compute_gradients
tower_outputs = self._multi_gpu_parallel_grad_calc([postprocessed_batch])
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/scratch/ftonti/miniconda3/envs/rllib/lib/python3.11/site-packages/ray/rllib/policy/torch_policy_v2.py”, line 1421, in _multi_gpu_parallel_grad_calc
raise last_result[0] from last_result[1]
ValueError: obs
tracebackTraceback (most recent call last):
File “/scratch/ftonti/miniconda3/envs/rllib/lib/python3.11/site-packages/ray/rllib/policy/torch_policy_v2.py”, line 1336, in _worker
self.loss(model, self.dist_class, sample_batch)
File “/scratch/ftonti/miniconda3/envs/rllib/lib/python3.11/site-packages/ray/rllib/algorithms/ppo/ppo_torch_policy.py”, line 84, in loss
logits, state = model(train_batch)
^^^^^^^^^^^^^^^^^^
File “/scratch/ftonti/miniconda3/envs/rllib/lib/python3.11/site-packages/ray/rllib/models/modelv2.py”, line 244, in call
input_dict[“obs”], self.obs_space, self.framework
~~~~~~~~~~^^^^^^^
File “/scratch/ftonti/miniconda3/envs/rllib/lib/python3.11/site-packages/ray/rllib/policy/sample_batch.py”, line 950, in getitem
value = dict.getitem(self, key)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
KeyError: ‘obs’

In tower 0 on device cpu

I get the following error when running the script on Linux Debian, but it runs on my Mac M2. The environment is the same, the script is exactly the same. :frowning: I tried milion times but I could not solve the issue. Can you give me some help? Thanks a lot!

I have a similar problem. I have an environment that I’ve created and successfully trained with rllib 2.10.0 on ArchLinux, but after upgrading to rllib 2.21 it stopped working. Tried downgrading to at least version 2.11 and still had the issue. Downgrading back to 2.10 solved the issue.

I have another environment that I’ve created and trained with version 2.10 and had no problems with on 2.21. I think it may have something to do with the first model (the one that won’t work) having a Dict within Dict in its observation space.

2 Likes

In Ray 2.30, this can be patched by either reducing rollout_fragment_length or increasing sample_timeout_s within env_runners. This error is caused by a timeout in data collection of rollout fragments but these two modifications can mitigate it.

Ensure rollout_fragment_length is evenly divisible by train_batch_size within 10%. You’ll see a different error if this isn’t accurate.