I am trying to run multiple environments that require GPU resources. My goal is to allocate a fraction of the GPU (e.g., 0.05) for the learner worker (policy) and share the remaining fraction among the environment runners. Here’s the relevant code snippet:
config.resources(
num_cpus_per_worker=1,
num_gpus_per_worker=(1 - 0.05) / args.num_env_runners if torch.cuda.is_available() else 0,
num_gpus_per_learner_worker=0.05 if torch.cuda.is_available() else 0,
num_cpus_per_learner_worker=1,
)
However, I am encountering the following error:
2024-06-20 02:51:31,348 ERROR tune_controller.py:1331 -- Trial task failed for trial PPO_learning_rate_env_bf14a_00000
Traceback (most recent call last):
File "/home/david/anaconda3/envs/ray/lib/python3.10/site-packages/ray/air/execution/_internal/event_manager.py", line 110, in resolve_future
result = ray.get(future)
File "/home/david/anaconda3/envs/ray/lib/python3.10/site-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper
return fn(*args, **kwargs)
...
IndexError: list index out of range
(PPO pid=719218) 2024-06-20 02:51:31,344 WARNING rollout_ops.py:115 -- No samples returned from remote workers. If you have a slow environment or model, consider increasing the `sample_timeout_s` or decreasing the `rollout_fragment_length` in `AlgorithmConfig.env_runners()`.
I tried increasing sample_timeout_s
, but it didn’t resolve the issue.
Can anyone help me understand what might be causing this error and suggest a solution to properly allocate GPU resources for multiple environments? I would greatly appreciate any insights or guidance on this matter.