How severe does this issue affect your experience of using Ray?
- High: It blocks me to complete my task.
I am running out of CUDA memory during the training of an PPO agent using Ray+ RLlib 1.12.1. I am a bit confused about it, because it always happens after the first iteration (it is seeded), as in it is able to train for the first iteration, no problem for it, then shortly after it has an error that it runs out of CUDA memory, here is the error trace:
Failure # 1 (occurred at 2022-07-21_17-07-06)
Traceback (most recent call last):
File "/path/envs/cuda11-0v2/lib/python3.8/site-packages/ray/tune/ray_trial_executor.py", line 901, in get_next_executor_event
future_result = ray.get(ready_future)
File "/path/envs/cuda11-0v2/lib/python3.8/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
return func(*args, **kwargs)
File "/path/envs/cuda11-0v2/lib/python3.8/site-packages/ray/worker.py", line 1809, in get
raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(RuntimeError): ^[[36mray::PPOTrainer.train()^[[39m (pid=2527837, ip=10.139.202.137, repr=PPOTrainer)
File "/path/envs/cuda11-0v2/lib/python3.8/site-packages/ray/tune/trainable.py", line 349, in train
result = self.step()
File "/path/envs/cuda11-0v2/lib/python3.8/site-packages/ray/rllib/agents/trainer.py", line 1093, in step
raise e
File "/path/envs/cuda11-0v2/lib/python3.8/site-packages/ray/rllib/agents/trainer.py", line 1074, in step
step_attempt_results = self.step_attempt()
File "/path/envs/cuda11-0v2/lib/python3.8/site-packages/ray/rllib/agents/trainer.py", line 1155, in step_attempt
step_results = self._exec_plan_or_training_iteration_fn()
File "/path/envs/cuda11-0v2/lib/python3.8/site-packages/ray/rllib/agents/trainer.py", line 2174, in _exec_plan_or_training_iteration_fn
results = next(self.train_exec_impl)
File "/path/envs/cuda11-0v2/lib/python3.8/site-packages/ray/util/iter.py", line 779, in __next__
return next(self.built_iterator)
File "/path/envs/cuda11-0v2/lib/python3.8/site-packages/ray/util/iter.py", line 807, in apply_foreach
for item in it:
File "/path/envs/cuda11-0v2/lib/python3.8/site-packages/ray/util/iter.py", line 807, in apply_foreach
for item in it:
File "/path/envs/cuda11-0v2/lib/python3.8/site-packages/ray/util/iter.py", line 869, in apply_filter
for item in it:
File "/path/envs/cuda11-0v2/lib/python3.8/site-packages/ray/util/iter.py", line 869, in apply_filter
for item in it:
File "/path/envs/cuda11-0v2/lib/python3.8/site-packages/ray/util/iter.py", line 807, in apply_foreach
for item in it:
File "/path/envs/cuda11-0v2/lib/python3.8/site-packages/ray/util/iter.py", line 807, in apply_foreach
for item in it:
File "/path/envs/cuda11-0v2/lib/python3.8/site-packages/ray/util/iter.py", line 815, in apply_foreach
result = fn(item)
File "/path/envs/cuda11-0v2/lib/python3.8/site-packages/ray/rllib/execution/train_ops.py", line 318, in __call__
num_loaded_samples[policy_id] = self.local_worker.policy_map[
File "/path/envs/cuda11-0v2/lib/python3.8/site-packages/ray/rllib/policy/torch_policy.py", line 531, in load_batch_into_buffer
slices = [slice.to_device(self.devices[i]) for i, slice in enumerate(slices)]
File "/path/envs/cuda11-0v2/lib/python3.8/site-packages/ray/rllib/policy/torch_policy.py", line 531, in <listcomp>
slices = [slice.to_device(self.devices[i]) for i, slice in enumerate(slices)]
File "/path/envs/cuda11-0v2/lib/python3.8/site-packages/ray/rllib/policy/sample_batch.py", line 681, in to_device
self[k] = torch.from_numpy(v).to(device)
RuntimeError: CUDA out of memory. Tried to allocate 9.85 GiB (GPU 0; 39.5 GiB total capacity; 31.33 GiB already allocated; 6.59 GiB free; 31.66 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
I also tried to make the rollout_fragement_length=1 (also 16, 32, 100, 200) since the error is during batch sampling, but the error still happens. The fact that it happens after the first iteration is trained successfully (giving me metrics that make sense and all) is just strange. Is there a different config I should change related to PPO? Or anything I should try? I initial tried to change the sgd mini batch size before I properly read the error message, but that did not help either.
Thanks for any help.