How severe does this issue affect your experience of using Ray?
- High: It blocks me to complete my task.
I am new to ray and trying to run an open source code at yuta0821/agent57_pytorch. The ray is initialized as shown below:
ray.init(ignore_reinit_error=True, local_mode=False)
Unfortunately, it throws RayOutOfMemoryError
error. The error trace is shown below:
$ python main.py
2022-09-01 18:53:48,493 INFO services.py:1272 -- View the Ray dashboard at http://127.0.0.1:8265
2022-09-01 18:54:08,569 WARNING worker.py:1123 -- The actor or task with ID ffffffffffffffffe61e76de9d02ba3837f27fdc01000000 cannot be scheduled right now. It requires {CPU: 1.000000} for placement, but this node only has remaining {0.000000/16.000000 CPU, 18.065582 GiB/18.065582 GiB memory, 3.000000/4.000000 GPU, 9.032791 GiB/9.032791 GiB object_store_memory, 1.000000/1.000000 accelerator_type:G, 1.000000/1.000000 node:192.168.102.112}
. In total there are 0 pending tasks and 2 pending actors on this node. This is likely due to all cluster resources being claimed by actors. To resolve the issue, consider creating fewer actors or increase the resources available to this Ray cluster. You can ignore this message if this Ray cluster is expected to auto-scale or if you specified a runtime_env for this task or actor because it takes time to install.
====================================================================================================
Traceback (most recent call last):
File "/home/ravi/agent57_pytorch/main.py", line 264, in <module>
main(parser.parse_args())
File "/home/ravi/agent57_pytorch/main.py", line 129, in main
priorities, segments, pid = ray.get(finished[0])
File "/home/ravi/anaconda/envs/learning/lib/python3.9/site-packages/ray/_private/client_mode_hook.py", line 62, in wrapper
return func(*args, **kwargs)
File "/home/ravi/anaconda/envs/learning/lib/python3.9/site-packages/ray/worker.py", line 1495, in get
raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(RayOutOfMemoryError): ray::Agent.sync_weights_and_rollout() (pid=26294, ip=192.168.102.112)
File "python/ray/_raylet.pyx", line 458, in ray._raylet.execute_task
File "/home/ravi/anaconda/envs/learning/lib/python3.9/site-packages/ray/_private/memory_monitor.py", line 139, in raise_if_low_memory
raise RayOutOfMemoryError(
ray._private.memory_monitor.RayOutOfMemoryError: More than 95% of the memory on node dellpc is used (30.35 / 30.97 GB). The top 10 memory consumers are:
PID MEM COMMAND
26301 7.62GiB ray::Learner.update_network()
26293 2.47GiB ray::Agent.sync_weights_and_rollout()
26304 2.46GiB ray::Agent.sync_weights_and_rollout()
26296 2.12GiB ray::Agent
26298 1.73GiB ray::Agent.sync_weights_and_rollout()
26295 1.71GiB ray::Agent.sync_weights_and_rollout()
26292 1.7GiB ray::Agent.sync_weights_and_rollout()
26291 1.64GiB ray::Agent.sync_weights_and_rollout()
26299 1.34GiB ray::Agent.sync_weights_and_rollout()
26305 0.93GiB ray::Agent.sync_weights_and_rollout()
In addition, up to 0.09 GiB of shared memory is currently being used by the Ray object store.
---
--- Tip: Use the `ray memory` command to list active objects in the cluster.
--- To disable OOM exceptions, set RAY_DISABLE_MEMORY_MONITOR=1.
---
-
I found workarounds reported in Stack Overflow [1] (I can not post more than 2 links, sorry!).
- While debugging, I noticed that line 129 causes the error. Therefore, I added
auto_garbage_collect()
(please see the above hyperlink) before and after this line but with no success!
- While debugging, I noticed that line 129 causes the error. Therefore, I added
-
Setting
RAY_DISABLE_MEMORY_MONITOR=1
does not help, either.
Version Information
- Python 3.9.12
- Ray 1.4.1
- Torch 1.9.0+cu102 (
torch.cuda.is_available()
returnsTrue
) - GCC 7.5.0
- Ubuntu 18.04.5 LTS
- Kernel 5.4.0-124-generic
System Information
It is a Intel(R) Xeon(R) W-3223 CPU @ 3.50GHz having 32 GB RAM which is powered by 4 x NVIDIA GeForce RTX 2080 Ti cards. Below are the details:
- CPU Information
$ lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 16 On-line CPU(s) list: 0-15 Thread(s) per core: 2 Core(s) per socket: 8 Socket(s): 1 NUMA node(s): 1 Vendor ID: GenuineIntel CPU family: 6 Model: 85 Model name: Intel(R) Xeon(R) W-3223 CPU @ 3.50GHz Stepping: 7 CPU MHz: 1200.043 CPU max MHz: 4200.0000 CPU min MHz: 1200.0000 BogoMIPS: 7000.00 Virtualization: VT-x L1d cache: 32K L1i cache: 32K L2 cache: 1024K L3 cache: 16896K NUMA node0 CPU(s): 0-15 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single intel_ppin ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb intel_pt avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts hwp hwp_act_window hwp_epp hwp_pkg_req pku ospke avx512_vnni md_clear flush_l1d arch_capabilities
- GPU Information
$ nvidia-smi Thu Sep 1 19:26:13 2022 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 470.141.03 Driver Version: 470.141.03 CUDA Version: 11.4 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 NVIDIA GeForce ... Off | 00000000:19:00.0 Off | N/A | | 27% 34C P8 18W / 250W | 10MiB / 11019MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 1 NVIDIA GeForce ... Off | 00000000:53:00.0 Off | N/A | | 27% 33C P8 8W / 250W | 203MiB / 11011MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 2 NVIDIA GeForce ... Off | 00000000:8D:00.0 Off | N/A | | 27% 33C P8 14W / 250W | 10MiB / 11019MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 3 NVIDIA GeForce ... Off | 00000000:C7:00.0 Off | N/A | | 27% 30C P8 15W / 250W | 10MiB / 11019MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | 0 N/A N/A 4660 G /usr/lib/xorg/Xorg 4MiB | | 0 N/A N/A 10346 G /usr/lib/xorg/Xorg 4MiB | | 1 N/A N/A 4660 G /usr/lib/xorg/Xorg 18MiB | | 1 N/A N/A 7529 G /usr/bin/gnome-shell 60MiB | | 1 N/A N/A 10346 G /usr/lib/xorg/Xorg 95MiB | | 1 N/A N/A 10479 G /usr/bin/gnome-shell 25MiB | | 2 N/A N/A 4660 G /usr/lib/xorg/Xorg 4MiB | | 2 N/A N/A 10346 G /usr/lib/xorg/Xorg 4MiB | | 3 N/A N/A 4660 G /usr/lib/xorg/Xorg 4MiB | | 3 N/A N/A 10346 G /usr/lib/xorg/Xorg 4MiB | +-----------------------------------------------------------------------------+
-
top command
$ top top - 17:46:59 up 35 min, 4 users, load average: 0.17, 0.27, 0.20 Tasks: 441 total, 1 running, 328 sleeping, 0 stopped, 0 zombie %Cpu(s): 0.8 us, 0.3 sy, 0.0 ni, 98.9 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st KiB Mem : 32479080 total, 22107904 free, 2806264 used, 7564912 buff/cache KiB Swap: 67108860 total, 67108860 free, 0 used. 29168684 avail Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 12359 ravi 20 0 45372 4224 3364 R 11.1 0.0 0:00.03 top 1 root 20 0 225508 9180 6592 S 0.0 0.0 0:01.94 systemd 2 root 20 0 0 0 0 S 0.0 0.0 0:00.00 kthreadd 3 root 0 -20 0 0 0 I 0.0 0.0 0:00.00 rcu_gp 4 root 0 -20 0 0 0 I 0.0 0.0 0:00.00 rcu_par_gp 6 root 0 -20 0 0 0 I 0.0 0.0 0:00.00 kworker/0:0H-kb 9 root 0 -20 0 0 0 I 0.0 0.0 0:00.00 mm_percpu_wq 10 root 20 0 0 0 0 S 0.0 0.0 0:00.01 ksoftirqd/0 11 root 20 0 0 0 0 I 0.0 0.0 0:00.94 rcu_sched 12 root rt 0 0 0 0 S 0.0 0.0 0:00.02 migration/0 13 root -51 0 0 0 0 S 0.0 0.0 0:00.00 idle_inject/0 14 root 20 0 0 0 0 S 0.0 0.0 0:00.00 cpuhp/0 15 root 20 0 0 0 0 S 0.0 0.0 0:00.00 cpuhp/1 16 root -51 0 0 0 0 S 0.0 0.0 0:00.00 idle_inject/1 17 root rt 0 0 0 0 S 0.0 0.0 0:00.24 migration/1 18 root 20 0 0 0 0 S 0.0 0.0 0:00.00 ksoftirqd/1 20 root 0 -20 0 0 0 I 0.0 0.0 0:00.00 kworker/1:0H-kb
I am looking for workarounds to fix the RayOutOfMemoryError
issue. Any help will be highly appreciated!