Ray on Slurm: shutdown throws errors

How severe does this issue affect your experience of using Ray?

  • Low: It annoys or frustrates me for a moment.

I am using Ray on a new HPC system as a beta tester (so there may be configuration issues). After some trial and error, I am able to successfully launch a cluster, assign CPUs, etc. However, I’d like to shut down the cluster cleanly, so that I can also see error codes returned by my program, etc.

I launch ray start on the head and worker nodes with the option --block. I reserve 1 core to run all other commands, including the main python script (that dispatches tasks) and finally the ray stop command.

My problem is that ray stop (executed on the head node) leads to a large number of errors being reported, both in single node and multi node configurations. On the head node:

2022-04-20 13:53:02,964 ERR scripts.py:898 -- Some Ray subprcesses exited unexpectedly:
2022-04-20 13:53:02,965 ERR scripts.py:902 -- gcs_server [exit code=0]
2022-04-20 13:53:02,965 ERR scripts.py:902 -- ray_client_server [exit code=15]
2022-04-20 13:53:02,965 ERR scripts.py:902 -- log_monitor [exit code=-15]
2022-04-20 13:53:02,965 ERR scripts.py:910 -- Remaining processes will be killed.

From 3 worker nodes:

2022-04-20 13:54:42,479 ERR scripts.py:898 -- Some Ray subprcesses exited unexpectedly:
2022-04-20 13:54:42,479 ERR scripts.py:902 -- raylet [exit code=1]
2022-04-20 13:54:42,479 ERR scripts.py:910 -- Remaining processes will be killed.
2022-04-20 13:54:59,926 ERR scripts.py:898 -- Some Ray subprcesses exited unexpectedly:
2022-04-20 13:54:59,926 ERR scripts.py:902 -- raylet [exit code=1]
2022-04-20 13:54:59,926 ERR scripts.py:910 -- Remaining processes will be killed.
2022-04-20 13:55:00,328 ERR scripts.py:898 -- Some Ray subprcesses exited unexpectedly:
2022-04-20 13:55:00,328 ERR scripts.py:902 -- raylet [exit code=1]
2022-04-20 13:55:00,328 ERR scripts.py:910 -- Remaining processes will be killed.

I have now resorted to calling ray start within a script that contains the following construction to suppress errors being forwarded to Slurm:

eval "${@:2}" || echo "Exit with error code $? (suppressed)"

It’s not very satisfying, and forces me to block all exit codes. Does anyone know what might be causing this?
Thank you!

Starting and stopping ray on the login node (with a non-blocking ray start command) does not result in such errors.

Does anyone have any suggestions?

@Alex , can you please help address this?

Any thoughts, @tupui?

Sorry I don’t know right now. So far I just played with basics on a SLURM cluster and I did not see such issues.

Ok, thanks for looking. When I have more time, I may run a few more tests, such as starting/stopping within the same script, stopping when the started head node is not blocking, etc.