Running out of CUDA Memory - Sample Batch , rollout fragment length is not helping

How severe does this issue affect your experience of using Ray?

  • High: It blocks me to complete my task.

I am running out of CUDA memory during the training of an PPO agent using Ray+ RLlib 1.12.1. I am a bit confused about it, because it always happens after the first iteration (it is seeded), as in it is able to train for the first iteration, no problem for it, then shortly after it has an error that it runs out of CUDA memory, here is the error trace:

Failure # 1 (occurred at 2022-07-21_17-07-06)
Traceback (most recent call last):
  File "/path/envs/cuda11-0v2/lib/python3.8/site-packages/ray/tune/ray_trial_executor.py", line 901, in get_next_executor_event
    future_result = ray.get(ready_future)
  File "/path/envs/cuda11-0v2/lib/python3.8/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
    return func(*args, **kwargs)
  File "/path/envs/cuda11-0v2/lib/python3.8/site-packages/ray/worker.py", line 1809, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(RuntimeError): ^[[36mray::PPOTrainer.train()^[[39m (pid=2527837, ip=10.139.202.137, repr=PPOTrainer)
  File "/path/envs/cuda11-0v2/lib/python3.8/site-packages/ray/tune/trainable.py", line 349, in train
    result = self.step()
  File "/path/envs/cuda11-0v2/lib/python3.8/site-packages/ray/rllib/agents/trainer.py", line 1093, in step
    raise e
  File "/path/envs/cuda11-0v2/lib/python3.8/site-packages/ray/rllib/agents/trainer.py", line 1074, in step
    step_attempt_results = self.step_attempt()
  File "/path/envs/cuda11-0v2/lib/python3.8/site-packages/ray/rllib/agents/trainer.py", line 1155, in step_attempt
    step_results = self._exec_plan_or_training_iteration_fn()
  File "/path/envs/cuda11-0v2/lib/python3.8/site-packages/ray/rllib/agents/trainer.py", line 2174, in _exec_plan_or_training_iteration_fn
    results = next(self.train_exec_impl)
  File "/path/envs/cuda11-0v2/lib/python3.8/site-packages/ray/util/iter.py", line 779, in __next__
    return next(self.built_iterator)
  File "/path/envs/cuda11-0v2/lib/python3.8/site-packages/ray/util/iter.py", line 807, in apply_foreach
    for item in it:
  File "/path/envs/cuda11-0v2/lib/python3.8/site-packages/ray/util/iter.py", line 807, in apply_foreach
    for item in it:
  File "/path/envs/cuda11-0v2/lib/python3.8/site-packages/ray/util/iter.py", line 869, in apply_filter
    for item in it:
  File "/path/envs/cuda11-0v2/lib/python3.8/site-packages/ray/util/iter.py", line 869, in apply_filter
    for item in it:
  File "/path/envs/cuda11-0v2/lib/python3.8/site-packages/ray/util/iter.py", line 807, in apply_foreach
    for item in it:
  File "/path/envs/cuda11-0v2/lib/python3.8/site-packages/ray/util/iter.py", line 807, in apply_foreach
    for item in it:
  File "/path/envs/cuda11-0v2/lib/python3.8/site-packages/ray/util/iter.py", line 815, in apply_foreach
    result = fn(item)
  File "/path/envs/cuda11-0v2/lib/python3.8/site-packages/ray/rllib/execution/train_ops.py", line 318, in __call__
    num_loaded_samples[policy_id] = self.local_worker.policy_map[
  File "/path/envs/cuda11-0v2/lib/python3.8/site-packages/ray/rllib/policy/torch_policy.py", line 531, in load_batch_into_buffer
    slices = [slice.to_device(self.devices[i]) for i, slice in enumerate(slices)]
  File "/path/envs/cuda11-0v2/lib/python3.8/site-packages/ray/rllib/policy/torch_policy.py", line 531, in <listcomp>
    slices = [slice.to_device(self.devices[i]) for i, slice in enumerate(slices)]
  File "/path/envs/cuda11-0v2/lib/python3.8/site-packages/ray/rllib/policy/sample_batch.py", line 681, in to_device
    self[k] = torch.from_numpy(v).to(device)
RuntimeError: CUDA out of memory. Tried to allocate 9.85 GiB (GPU 0; 39.5 GiB total capacity; 31.33 GiB already allocated; 6.59 GiB free; 31.66 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

I also tried to make the rollout_fragement_length=1 (also 16, 32, 100, 200) since the error is during batch sampling, but the error still happens. The fact that it happens after the first iteration is trained successfully (giving me metrics that make sense and all) is just strange. Is there a different config I should change related to PPO? Or anything I should try? I initial tried to change the sgd mini batch size before I properly read the error message, but that did not help either.

Thanks for any help.

I am having the exact same issue. Always fails on the second training iteration due to additional VRAM allocation. It seems the memory allocation is tied to train_batch_size, but is unclear through documentation. The only memory which should be allocated would be the sgd_minibatch_size, but the documentation does not drive a clear answer as to the configuration for that.

Ideally we should be able to have a train_batch_size as large as cpu ram can hold and only work with sgd_minibatch_size on the GPU.

Hi @lucasalavapena @Chris_Graham ,

GPU memory allocation should be tied to model size and train_batch_size/sgd_minibatch_size .
I’ve not experienced this and can not find a related issue.

  1. Is the issue present on our master branch on your side?
  2. Does system memory model usage also double if you don’t use your GPU?

Best,
Artur