Proc/Meminfo Error Distributed PPO

arka · May 21, 2021, 1:57pm

Hello,

I am getting the following error sometimes when running distributed PPO:

RayTaskError(FileNotFoundError): ray::RolloutWorker.par_iter_next() (pid=8351, ip=10.2.230.146)
File “python/ray/_raylet.pyx”, line 446, in ray._raylet.execute_task
File “/home/ubuntu/conda/envs/venv/lib/python3.8/site-packages/ray/memory_monitor.py”, line 135, in raise_if_low_memory
used_gb, total_gb = self.get_memory_usage()
File “/home/ubuntu/conda/envs/venv/lib/python3.8/site-packages/ray/memory_monitor.py”, line 106, in get_memory_usage
psutil_mem = psutil.virtual_memory()
File “/home/ubuntu/conda/envs/venv/lib/python3.8/site-packages/ray/thirdparty_files/psutil/init.py”, line 1983, in virtual_memory
ret = _psplatform.virtual_memory()
File “/home/ubuntu/conda/envs/venv/lib/python3.8/site-packages/ray/thirdparty_files/psutil/_pslinux.py”, line 391, in virtual_memory
with open_binary(‘%s/meminfo’ % get_procfs_path()) as f:
File “/home/ubuntu/conda/envs/venv/lib/python3.8/site-packages/ray/thirdparty_files/psutil/_common.py”, line 713, in open_binary
return open(fname, “rb”, **kwargs)
FileNotFoundError: [Errno 2] No such file or directory: ‘/proc/meminfo’

This has happened with plenty of memory left available on the machines - and sometimes it happens after running only for a very short amount of time.

I haven’t been able to find any other mention of this issue here except for https://github.com/ray-project/ray/issues/4474.

Thanks

kai · May 21, 2021, 2:01pm

This is most likely a machine/OS problem, as /proc/meminfo should usually be available. Can you share a bit more about your setup? Is this a cloud instance? Are you running in a cluster? Are you using the Ray cluster launcher? Does Ray run in a docker container or another kind of virtualization?

arka · May 21, 2021, 2:13pm

Yes it’s a AWS EC2 setup, using m5.16x instances. We are using the Ray cluster launcher for this - and Ray isn’t run in a docker etc.

rliaw · May 21, 2021, 10:37pm

I think a relevant fix for this was reported a couple months ago:

github.com/ray-project/ray

Memory management can not find /proc/meminfo

opened 10:11AM - 25 Mar 19 UTC

closed 11:56AM - 29 Nov 20 UTC

louiskirsch

stale

### System information - **OS Platform and Distribution (e.g., Linux Ubuntu 16.04)**: Ubuntu 18.04.1 LTS - **Ray installed from (source or binary)**: binary master - **Ray version**: 0.7-dev1 binary - **Python version**: 3.6 - **Exact command to reproduce**:  ### Describe the problem Memory management crashes. Running `cat /proc/meminfo` as the same user is successful. ### Source code / logs 2019-03-25 10:33:42,391 ERROR worker.py:1717 -- Possible unhandled error from worker: ray_AgentWorker:update() (pid=28713, host=XXX) File "/home/louis/lib/python3.6/site-packages/ray/memory_monitor.py", line 73, in raise_if_low_memory used_gb = total_gb - psutil.virtual_memory().available / 1e9 File "/home/louis/lib/python3.6/site-packages/psutil/__init__.py", line 1946, in virtual_memory ret = _psplatform.virtual_memory() File "/home/louis/lib/python3.6/site-packages/psutil/_pslinux.py", line 397, in virtual_memory with open_binary('%s/meminfo' % get_procfs_path()) as f: File "/home/louis/lib/python3.6/site-packages/psutil/_common.py", line 582, in open_binary return open(fname, "rb", **kwargs) FileNotFoundError: [Errno 2] No such file or directory: '/proc/meminfo'

MattLongshot · May 26, 2021, 4:50pm

Are there any potential downsides from setting RAY_DEBUG_DISABLE_MEMORY_MONITOR=1 to avoid this code path entirely?

Topic		Replies	Views
Correct implementation for PPO reset_config() RLlib	1	201	April 7, 2024
_winapi.CreateProcess(executable, args, FileNotFoundError: [WinError 2] RLlib	9	18029	May 14, 2022
RayTaskError(AttributeError) : ray::RolloutWorker.par_iter_next() RLlib	12	1429	February 21, 2022
Structure's sequence length mismatch issue from sgd code for PPO policy RLlib	2	274	January 19, 2024
Issue with Checkpointing in Ray 2.9.1 on Windows 11 while Training PPO Algorithm Checkpointing, Restoring	1	242	January 30, 2024

Proc/Meminfo Error Distributed PPO

Related topics