How severe does this issue affect your experience of using Ray?
- Low: It annoys or frustrates me for a moment.
I am using Ray on a new HPC system as a beta tester (so there may be configuration issues). After some trial and error, I am able to successfully launch a cluster, assign CPUs, etc. However, I’d like to shut down the cluster cleanly, so that I can also see error codes returned by my program, etc.
I launch ray start
on the head and worker nodes with the option --block
. I reserve 1 core to run all other commands, including the main python script (that dispatches tasks) and finally the ray stop
command.
My problem is that ray stop
(executed on the head node) leads to a large number of errors being reported, both in single node and multi node configurations. On the head node:
2022-04-20 13:53:02,964 ERR scripts.py:898 -- Some Ray subprcesses exited unexpectedly:
2022-04-20 13:53:02,965 ERR scripts.py:902 -- gcs_server [exit code=0]
2022-04-20 13:53:02,965 ERR scripts.py:902 -- ray_client_server [exit code=15]
2022-04-20 13:53:02,965 ERR scripts.py:902 -- log_monitor [exit code=-15]
2022-04-20 13:53:02,965 ERR scripts.py:910 -- Remaining processes will be killed.
From 3 worker nodes:
2022-04-20 13:54:42,479 ERR scripts.py:898 -- Some Ray subprcesses exited unexpectedly:
2022-04-20 13:54:42,479 ERR scripts.py:902 -- raylet [exit code=1]
2022-04-20 13:54:42,479 ERR scripts.py:910 -- Remaining processes will be killed.
2022-04-20 13:54:59,926 ERR scripts.py:898 -- Some Ray subprcesses exited unexpectedly:
2022-04-20 13:54:59,926 ERR scripts.py:902 -- raylet [exit code=1]
2022-04-20 13:54:59,926 ERR scripts.py:910 -- Remaining processes will be killed.
2022-04-20 13:55:00,328 ERR scripts.py:898 -- Some Ray subprcesses exited unexpectedly:
2022-04-20 13:55:00,328 ERR scripts.py:902 -- raylet [exit code=1]
2022-04-20 13:55:00,328 ERR scripts.py:910 -- Remaining processes will be killed.
I have now resorted to calling ray start
within a script that contains the following construction to suppress errors being forwarded to Slurm:
eval "${@:2}" || echo "Exit with error code $? (suppressed)"
It’s not very satisfying, and forces me to block all exit codes. Does anyone know what might be causing this?
Thank you!