Multi Node Ray Cluster with docker container and slurm

Hi all,

I am deploying a multi-node ray cluster with slurm on our HPC within a container. The code runs but will occasionally error out with the message seen below. I’ve checked on the hpc that the container has access to the /proc directory and psutil works fine when I check interactively. I reverted to using an older version of psutil [5.6.0] to no avail. This gets incredibly frustrating as I can’t pinpoint when this occurs because it seems pretty random.

The structure of the code:

  • Loops over certain cases.
  • each case is parallelized across all cpus on all nodes
  • saves an output file for each case

The errors below will happen sporadically when after a few cases are complete, in the middle of a case, or not at all. All cases that are completed before the error have outputs that look just fine.

Thanks in advance and I can give more info as needed.

ERROR LOG:

e[2me[33m(raylet)e[0m During handling of the above exception, another exception occurred:
e[2me[33m(raylet)e[0m
e[2me[33m(raylet)e[0m Traceback (most recent call last):
e[2me[33m(raylet)e[0m File “/src/vas_monorepo/container/venv/sx_rf_interference/lib/python3.7/site-packages/psutil/_pslinux.py”, line 318, in
e[2me[33m(raylet)e[0m set_scputimes_ntuple("/proc")
e[2me[33m(raylet)e[0m File “/src/vas_monorepo/container/venv/sx_rf_interference/lib/python3.7/site-packages/psutil/_common.py”, line 299, in wrapper
e[2me[33m(raylet)e[0m ret = cache[key] = fun(*args, **kwargs)
e[2me[33m(raylet)e[0m File “/src/vas_monorepo/container/venv/sx_rf_interference/lib/python3.7/site-packages/psutil/_pslinux.py”, line 285, in set_scputimes_ntuple
e[2me[33m(raylet)e[0m with open_binary(’%s/stat’ % procfs_path) as f:
e[2me[33m(raylet)e[0m File “/src/vas_monorepo/container/venv/sx_rf_interference/lib/python3.7/site-packages/psutil/_common.py”, line 586, in open_binary
e[2me[33m(raylet)e[0m return open(fname, “rb”, **kwargs)
e[2me[33m(raylet)e[0m FileNotFoundError: [Errno 2] No such file or directory: ‘/proc/stat’
e[2me[33m(raylet)e[0m /src/vas_monorepo/container/venv/sx_rf_interference/lib/python3.7/site-packages/ray/autoscaler/_private/cli_logger.py:61: FutureWarning: Not all Ray CLI dependencies were found. In Ray 1.4+, the Ray CLI, autoscaler, and dashboard will only be usable via pip install 'ray[default]'. Please update your install command.
e[2me[33m(raylet)e[0m “update your install command.”, FutureWarning)
e[2me[33m(raylet, ip=10.32.211.114)e[0m Traceback (most recent call last):
e[2me[33m(raylet, ip=10.32.211.114)e[0m File “/src/vas_monorepo/container/venv/sx_rf_interference/lib/python3.7/site-packages/psutil/_common.py”, line 297, in wrapper
e[2me[33m(raylet, ip=10.32.211.114)e[0m return cache[key]
e[2me[33m(raylet, ip=10.32.211.114)e[0m KeyError: ((’/proc’,), frozenset())
e[2me[33m(raylet, ip=10.32.211.114)e[0m
e[2me[33m(raylet, ip=10.32.211.114)e[0m During handling of the above exception, another exception occurred:
e[2me[33m(raylet, ip=10.32.211.114)e[0m
e[2me[33m(raylet, ip=10.32.211.114)e[0m Traceback (most recent call last):
e[2me[33m(raylet, ip=10.32.211.114)e[0m File “/src/vas_monorepo/container/venv/sx_rf_interference/lib/python3.7/site-packages/psutil/_pslinux.py”, line 318, in
e[2me[33m(raylet, ip=10.32.211.114)e[0m set_scputimes_ntuple("/proc")
e[2me[33m(raylet, ip=10.32.211.114)e[0m File “/src/vas_monorepo/container/venv/sx_rf_interference/lib/python3.7/site-packages/psutil/_common.py”, line 299, in wrapper
e[2me[33m(raylet, ip=10.32.211.114)e[0m ret = cache[key] = fun(*args, **kwargs)
e[2me[33m(raylet, ip=10.32.211.114)e[0m File “/src/vas_monorepo/container/venv/sx_rf_interference/lib/python3.7/site-packages/psutil/_pslinux.py”, line 285, in set_scputimes_ntuple
e[2me[33m(raylet, ip=10.32.211.114)e[0m with open_binary(’%s/stat’ % procfs_path) as f:
e[2me[33m(raylet, ip=10.32.211.114)e[0m File “/src/vas_monorepo/container/venv/sx_rf_interference/lib/python3.7/site-packages/psutil/_common.py”, line 586, in open_binary
e[2me[33m(raylet, ip=10.32.211.114)e[0m return open(fname, “rb”, **kwargs)
e[2me[33m(raylet, ip=10.32.211.114)e[0m FileNotFoundError: [Errno 2] No such file or directory: ‘/proc/stat’
e[2me[33m(raylet, ip=10.32.211.114)e[0m /src/vas_monorepo/container/venv/sx_rf_interference/lib/python3.7/site-packages/ray/autoscaler/_private/cli_logger.py:61: FutureWarning: Not all Ray CLI dependencies were found. In Ray 1.4+, the Ray CLI, autoscaler, and dashboard will only be usable via pip install 'ray[default]'. Please update your install command.
e[2me[33m(raylet, ip=10.32.211.114)e[0m “update your install command.”, FutureWarning)
e[2me[33m(raylet)e[0m Traceback (most recent call last):
e[2me[33m(raylet)e[0m File “/src/vas_monorepo/container/venv/sx_rf_interference/lib/python3.7/site-packages/psutil/_common.py”, line 297, in wrapper
e[2me[33m(raylet)e[0m return cache[key]
e[2me[33m(raylet)e[0m KeyError: ((’/proc’,), frozenset())
e[2me[33m(raylet)e[0m
e[2me[33m(raylet)e[0m During handling of the above exception, another exception occurred:
e[2me[33m(raylet)e[0m
e[2me[33m(raylet)e[0m Traceback (most recent call last):
e[2me[33m(raylet)e[0m File “/src/vas_monorepo/container/venv/sx_rf_interference/lib/python3.7/site-packages/psutil/_pslinux.py”, line 318, in
e[2me[33m(raylet)e[0m set_scputimes_ntuple("/proc")
e[2me[33m(raylet)e[0m File “/src/vas_monorepo/container/venv/sx_rf_interference/lib/python3.7/site-packages/psutil/_common.py”, line 299, in wrapper
e[2me[33m(raylet)e[0m ret = cache[key] = fun(*args, **kwargs)
e[2me[33m(raylet)e[0m File “/src/vas_monorepo/container/venv/sx_rf_interference/lib/python3.7/site-packages/psutil/_pslinux.py”, line 285, in set_scputimes_ntuple
e[2me[33m(raylet)e[0m with open_binary(’%s/stat’ % procfs_path) as f:
e[2me[33m(raylet)e[0m File “/src/vas_monorepo/container/venv/sx_rf_interference/lib/python3.7/site-packages/psutil/_common.py”, line 586, in open_binary
e[2me[33m(raylet)e[0m return open(fname, “rb”, **kwargs)