RayOutOfMemoryError: More than 95% of the memory is used

How severe does this issue affect your experience of using Ray?

  • High: It blocks me to complete my task.

I am new to ray and trying to run an open source code at yuta0821/agent57_pytorch. The ray is initialized as shown below:

ray.init(ignore_reinit_error=True, local_mode=False)

Unfortunately, it throws RayOutOfMemoryError error. The error trace is shown below:

$ python main.py
2022-09-01 18:53:48,493	INFO services.py:1272 -- View the Ray dashboard at http://127.0.0.1:8265
2022-09-01 18:54:08,569	WARNING worker.py:1123 -- The actor or task with ID ffffffffffffffffe61e76de9d02ba3837f27fdc01000000 cannot be scheduled right now. It requires {CPU: 1.000000} for placement, but this node only has remaining {0.000000/16.000000 CPU, 18.065582 GiB/18.065582 GiB memory, 3.000000/4.000000 GPU, 9.032791 GiB/9.032791 GiB object_store_memory, 1.000000/1.000000 accelerator_type:G, 1.000000/1.000000 node:192.168.102.112}
. In total there are 0 pending tasks and 2 pending actors on this node. This is likely due to all cluster resources being claimed by actors. To resolve the issue, consider creating fewer actors or increase the resources available to this Ray cluster. You can ignore this message if this Ray cluster is expected to auto-scale or if you specified a runtime_env for this task or actor because it takes time to install.
====================================================================================================
Traceback (most recent call last):
  File "/home/ravi/agent57_pytorch/main.py", line 264, in <module>
    main(parser.parse_args())
  File "/home/ravi/agent57_pytorch/main.py", line 129, in main
    priorities, segments, pid = ray.get(finished[0])
  File "/home/ravi/anaconda/envs/learning/lib/python3.9/site-packages/ray/_private/client_mode_hook.py", line 62, in wrapper
    return func(*args, **kwargs)
  File "/home/ravi/anaconda/envs/learning/lib/python3.9/site-packages/ray/worker.py", line 1495, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(RayOutOfMemoryError): ray::Agent.sync_weights_and_rollout() (pid=26294, ip=192.168.102.112)
  File "python/ray/_raylet.pyx", line 458, in ray._raylet.execute_task
  File "/home/ravi/anaconda/envs/learning/lib/python3.9/site-packages/ray/_private/memory_monitor.py", line 139, in raise_if_low_memory
    raise RayOutOfMemoryError(
ray._private.memory_monitor.RayOutOfMemoryError: More than 95% of the memory on node dellpc is used (30.35 / 30.97 GB). The top 10 memory consumers are:

PID	MEM	COMMAND
26301	7.62GiB	ray::Learner.update_network()
26293	2.47GiB	ray::Agent.sync_weights_and_rollout()
26304	2.46GiB	ray::Agent.sync_weights_and_rollout()
26296	2.12GiB	ray::Agent
26298	1.73GiB	ray::Agent.sync_weights_and_rollout()
26295	1.71GiB	ray::Agent.sync_weights_and_rollout()
26292	1.7GiB	ray::Agent.sync_weights_and_rollout()
26291	1.64GiB	ray::Agent.sync_weights_and_rollout()
26299	1.34GiB	ray::Agent.sync_weights_and_rollout()
26305	0.93GiB	ray::Agent.sync_weights_and_rollout()

In addition, up to 0.09 GiB of shared memory is currently being used by the Ray object store.
---
--- Tip: Use the `ray memory` command to list active objects in the cluster.
--- To disable OOM exceptions, set RAY_DISABLE_MEMORY_MONITOR=1.
---
  1. I found workarounds reported in Stack Overflow [1] (I can not post more than 2 links, sorry!).

    • While debugging, I noticed that line 129 causes the error. Therefore, I added auto_garbage_collect()(please see the above hyperlink) before and after this line but with no success!
  2. Setting RAY_DISABLE_MEMORY_MONITOR=1 does not help, either.

Version Information

  • Python 3.9.12
  • Ray 1.4.1
  • Torch 1.9.0+cu102 (torch.cuda.is_available() returns True)
  • GCC 7.5.0
  • Ubuntu 18.04.5 LTS
  • Kernel 5.4.0-124-generic

System Information

It is a Intel(R) Xeon(R) W-3223 CPU @ 3.50GHz having 32 GB RAM which is powered by 4 x NVIDIA GeForce RTX 2080 Ti cards. Below are the details:

  • CPU Information
    $ lscpu
    Architecture:        x86_64
    CPU op-mode(s):      32-bit, 64-bit
    Byte Order:          Little Endian
    CPU(s):              16
    On-line CPU(s) list: 0-15
    Thread(s) per core:  2
    Core(s) per socket:  8
    Socket(s):           1
    NUMA node(s):        1
    Vendor ID:           GenuineIntel
    CPU family:          6
    Model:               85
    Model name:          Intel(R) Xeon(R) W-3223 CPU @ 3.50GHz
    Stepping:            7
    CPU MHz:             1200.043
    CPU max MHz:         4200.0000
    CPU min MHz:         1200.0000
    BogoMIPS:            7000.00
    Virtualization:      VT-x
    L1d cache:           32K
    L1i cache:           32K
    L2 cache:            1024K
    L3 cache:            16896K
    NUMA node0 CPU(s):   0-15
    Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single intel_ppin ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb intel_pt avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts hwp hwp_act_window hwp_epp hwp_pkg_req pku ospke avx512_vnni md_clear flush_l1d arch_capabilities
    
  • GPU Information
    $ nvidia-smi 
    Thu Sep  1 19:26:13 2022       
    +-----------------------------------------------------------------------------+
    | NVIDIA-SMI 470.141.03   Driver Version: 470.141.03   CUDA Version: 11.4     |
    |-------------------------------+----------------------+----------------------+
    | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
    | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
    |                               |                      |               MIG M. |
    |===============================+======================+======================|
    |   0  NVIDIA GeForce ...  Off  | 00000000:19:00.0 Off |                  N/A |
    | 27%   34C    P8    18W / 250W |     10MiB / 11019MiB |      0%      Default |
    |                               |                      |                  N/A |
    +-------------------------------+----------------------+----------------------+
    |   1  NVIDIA GeForce ...  Off  | 00000000:53:00.0 Off |                  N/A |
    | 27%   33C    P8     8W / 250W |    203MiB / 11011MiB |      0%      Default |
    |                               |                      |                  N/A |
    +-------------------------------+----------------------+----------------------+
    |   2  NVIDIA GeForce ...  Off  | 00000000:8D:00.0 Off |                  N/A |
    | 27%   33C    P8    14W / 250W |     10MiB / 11019MiB |      0%      Default |
    |                               |                      |                  N/A |
    +-------------------------------+----------------------+----------------------+
    |   3  NVIDIA GeForce ...  Off  | 00000000:C7:00.0 Off |                  N/A |
    | 27%   30C    P8    15W / 250W |     10MiB / 11019MiB |      0%      Default |
    |                               |                      |                  N/A |
    +-------------------------------+----------------------+----------------------+
                                                                                   
    +-----------------------------------------------------------------------------+
    | Processes:                                                                  |
    |  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
    |        ID   ID                                                   Usage      |
    |=============================================================================|
    |    0   N/A  N/A      4660      G   /usr/lib/xorg/Xorg                  4MiB |
    |    0   N/A  N/A     10346      G   /usr/lib/xorg/Xorg                  4MiB |
    |    1   N/A  N/A      4660      G   /usr/lib/xorg/Xorg                 18MiB |
    |    1   N/A  N/A      7529      G   /usr/bin/gnome-shell               60MiB |
    |    1   N/A  N/A     10346      G   /usr/lib/xorg/Xorg                 95MiB |
    |    1   N/A  N/A     10479      G   /usr/bin/gnome-shell               25MiB |
    |    2   N/A  N/A      4660      G   /usr/lib/xorg/Xorg                  4MiB |
    |    2   N/A  N/A     10346      G   /usr/lib/xorg/Xorg                  4MiB |
    |    3   N/A  N/A      4660      G   /usr/lib/xorg/Xorg                  4MiB |
    |    3   N/A  N/A     10346      G   /usr/lib/xorg/Xorg                  4MiB |
    +-----------------------------------------------------------------------------+
    
  • top command
    $ top
    
    top - 17:46:59 up 35 min,  4 users,  load average: 0.17, 0.27, 0.20
    Tasks: 441 total,   1 running, 328 sleeping,   0 stopped,   0 zombie
    %Cpu(s):  0.8 us,  0.3 sy,  0.0 ni, 98.9 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
    KiB Mem : 32479080 total, 22107904 free,  2806264 used,  7564912 buff/cache
    KiB Swap: 67108860 total, 67108860 free,        0 used. 29168684 avail Mem
    
      PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
    12359 ravi      20   0   45372   4224   3364 R  11.1  0.0   0:00.03 top
        1 root      20   0  225508   9180   6592 S   0.0  0.0   0:01.94 systemd
        2 root      20   0       0      0      0 S   0.0  0.0   0:00.00 kthreadd
        3 root       0 -20       0      0      0 I   0.0  0.0   0:00.00 rcu_gp
        4 root       0 -20       0      0      0 I   0.0  0.0   0:00.00 rcu_par_gp
        6 root       0 -20       0      0      0 I   0.0  0.0   0:00.00 kworker/0:0H-kb
        9 root       0 -20       0      0      0 I   0.0  0.0   0:00.00 mm_percpu_wq
       10 root      20   0       0      0      0 S   0.0  0.0   0:00.01 ksoftirqd/0
       11 root      20   0       0      0      0 I   0.0  0.0   0:00.94 rcu_sched
       12 root      rt   0       0      0      0 S   0.0  0.0   0:00.02 migration/0
       13 root     -51   0       0      0      0 S   0.0  0.0   0:00.00 idle_inject/0
       14 root      20   0       0      0      0 S   0.0  0.0   0:00.00 cpuhp/0
       15 root      20   0       0      0      0 S   0.0  0.0   0:00.00 cpuhp/1
       16 root     -51   0       0      0      0 S   0.0  0.0   0:00.00 idle_inject/1
       17 root      rt   0       0      0      0 S   0.0  0.0   0:00.24 migration/1
       18 root      20   0       0      0      0 S   0.0  0.0   0:00.00 ksoftirqd/1
       20 root       0 -20       0      0      0 I   0.0  0.0   0:00.00 kworker/1:0H-kb
    

I am looking for workarounds to fix the RayOutOfMemoryError issue. Any help will be highly appreciated!

Any suggestions, please?

It looks like the memory issue is probably due to having too many Agent actors running in parallel. We’re actively working on this type of problem for v2.1 and 2.2, but for now I think the best thing to try would be to run fewer agents in parallel. There are two ways you can do this:

  1. Pass fewer num_cpus to ray.init, like ray.init(num_cpus=8), even though you have 16 vCPUs available.
  2. (suggested) Modify your actor definitions to request more CPUs. You can do this by modifying this line.

Thanks, @Stephanie_Wang, for the suggestions. Both of them are found working. I will share the more info. on the same at the end of this response.

Questions

Let me first ask a few questions:

  1. I understood that you advised to lower down num_cpus by using ray.init(num_cpus=8). However, later you suggested to increase the CPUs for the actor. How come more actors solve the RayOutOfMemoryError error? Can you please explain your intuition behind it?
  2. You mentioned “this type of problem for v2.1 and 2.2…”. I assume you are talking about Ray v2.1 and 2.2. I am sorry, but GitHub shows the latest version is Ray-2.0.0. Nevertheless, I am using Ray v1.4.1. I Just wanted to make sure we are on the same page!
  3. Is there any need for the workaround, i.e., auto_garbage_collect?

More Info.

You provided two suggestions. Let’s call them case 1 and case 2. Please see below:

  1. Case 1: ray.init(num_cpus=8)
    $ python main.py
    2022-09-08 18:09:56,458 INFO services.py:1272 -- View the Ray dashboard at http://127.0.0.1:8265
    2022-09-08 18:10:17,544 WARNING worker.py:1123 -- The actor or task with ID ffffffffffffffff5de1907879f5513234b9a19601000000 cannot be scheduled right now. It requires {CPU: 1.000000} for placement, but this node only has remaining {0.000000/8.000000 CPU, 15.432520 GiB/15.432520 GiB memory, 3.000000/4.000000 GPU, 7.716260 GiB/7.716260 GiB object_store_memory, 1.000000/1.000000 node:192.168.102.112, 1.000000/1.000000 accelerator_type:G}
    . In total there are 0 pending tasks and 10 pending actors on this node. This is likely due to all cluster resources being claimed by actors. To resolve the issue, consider creating fewer actors or increase the resources available to this Ray cluster. You can ignore this message if this Ray cluster is expected to auto-scale or if you specified a runtime_env for this task or actor because it takes time to install.
    
  2. Case 2: ray.init( default options ) but Agent is decorated with@ray.remote(num_cpus=2)
    $ python main.py
    2022-09-08 18:33:58,303 INFO services.py:1272 -- View the Ray dashboard at http://127.0.0.1:8265
    2022-09-08 18:34:18,357 WARNING worker.py:1123 -- The actor or task with ID ffffffffffffffffed49d9c6b60b711bbdc8c3df01000000 cannot be scheduled right now. It requires {CPU: 2.000000} for placement, but this node only has remaining {0.000000/16.000000 CPU, 15.442795 GiB/15.442795 GiB memory, 3.000000/4.000000 GPU, 7.721397 GiB/7.721397 GiB object_store_memory, 1.000000/1.000000 accelerator_type:G, 1.000000/1.000000 node:192.168.102.112}
    . In total there are 0 pending tasks and 9 pending actors on this node. This is likely due to all cluster resources being claimed by actors. To resolve the issue, consider creating fewer actors or increase the resources available to this Ray cluster. You can ignore this message if this Ray cluster is expected to auto-scale or if you specified a runtime_env for this task or actor because it takes time to install.
    

Please see below a combined screenshot of two Ray Dashboards.

In case 2, I see some of the works are sitting idle. Is it fine?

Thank you very much.

Glad it worked out for you! Great questions, let me try to answer each:

  1. I understood that you advised to lower down num_cpus by using ray.init(num_cpus=8). However, later you suggested to increase the CPUs for the actor. How come more actors solve the RayOutOfMemoryError error? Can you please explain your intuition behind it?

The goal of the workaround is to reduce the number of actor workers that are running in parallel at the same time, to reduce the overall heap memory footprint. In future Ray releases, we will do attempt to do this automatically, but for now you’ll need to modify resources yourself to make it happen.

The code that you linked requires 1 CPU per actor, which means that on your hardware, Ray will run 16 in parallel (Ray autodetects the number of vCPUs). So we have two ways to reduce that to 8 actors in parallel:

  1. Override the number of CPUs that Ray thinks is available to 8.
  2. Require twice as many CPUs per actor by passing num_cpus=2.
  1. You mentioned “this type of problem for v2.1 and 2.2…”. I assume you are talking about Ray v2.1 and 2.2. I am sorry, but GitHub shows the latest version is Ray-2.0.0. Nevertheless, I am using Ray v1.4.1. I Just wanted to make sure we are on the same page!

Yes, the latest release is 2.0 and 2.1 and 2.2 will most likely be released sometime in 2022.

  1. Is there any need for the workaround, i.e., auto_garbage_collect?

I think most likely not. The scenario that you linked is specifically for garbage-collecting Ray ObjectRefs. The problem there is that Ray ObjectRefs appear small in terms of their Python heap memory footprint, but that’s because their physical memory is stored somewhere else, in Ray’s shared-memory object store. Usually the way this problem will manifest is that you’ll see a lot of unexpected object spilling (in latest versions of Ray, you should receive a message on the driver about this). The application will also slow down a lot because it’s not releasing ObjectRefs, so then Ray has to spill those objects even though the application is no longer using them. However, it’s very unlikely that Ray would crash in this scenario.

The problem in this case is that the heap memory usage of the actors is too high. This usually happens as a result of application-level code requiring a lot of memory, not because of its ObjectRef usage. This problem will usually manifest in some Ray processes crashing, as you’re seeing here.

2 Likes

Also, yes, it should be okay that you have idle workers in Case 2. What’s happening is that Ray is still autodetecting the number of CPUs as 16, and it will pre-start this many worker processes to improve task scheduling time. Ray does this because the common case is that each task and actor requires 1 CPU. But in this case, since each actor requires 2 CPUs, we only need half of the worker processes.

Thanks @Stephanie_Wang

It makes sense now! Thanks for your wonderful explanation. Loving it.