Ray on Slurm: shutdown throws errors

How severe does this issue affect your experience of using Ray?

  • Low: It annoys or frustrates me for a moment.

I am using Ray on a new HPC system as a beta tester (so there may be configuration issues). After some trial and error, I am able to successfully launch a cluster, assign CPUs, etc. However, I’d like to shut down the cluster cleanly, so that I can also see error codes returned by my program, etc.

I launch ray start on the head and worker nodes with the option --block. I reserve 1 core to run all other commands, including the main python script (that dispatches tasks) and finally the ray stop command.

My problem is that ray stop (executed on the head node) leads to a large number of errors being reported, both in single node and multi node configurations. On the head node:

2022-04-20 13:53:02,964 ERR scripts.py:898 -- Some Ray subprcesses exited unexpectedly:
2022-04-20 13:53:02,965 ERR scripts.py:902 -- gcs_server [exit code=0]
2022-04-20 13:53:02,965 ERR scripts.py:902 -- ray_client_server [exit code=15]
2022-04-20 13:53:02,965 ERR scripts.py:902 -- log_monitor [exit code=-15]
2022-04-20 13:53:02,965 ERR scripts.py:910 -- Remaining processes will be killed.

From 3 worker nodes:

2022-04-20 13:54:42,479 ERR scripts.py:898 -- Some Ray subprcesses exited unexpectedly:
2022-04-20 13:54:42,479 ERR scripts.py:902 -- raylet [exit code=1]
2022-04-20 13:54:42,479 ERR scripts.py:910 -- Remaining processes will be killed.
2022-04-20 13:54:59,926 ERR scripts.py:898 -- Some Ray subprcesses exited unexpectedly:
2022-04-20 13:54:59,926 ERR scripts.py:902 -- raylet [exit code=1]
2022-04-20 13:54:59,926 ERR scripts.py:910 -- Remaining processes will be killed.
2022-04-20 13:55:00,328 ERR scripts.py:898 -- Some Ray subprcesses exited unexpectedly:
2022-04-20 13:55:00,328 ERR scripts.py:902 -- raylet [exit code=1]
2022-04-20 13:55:00,328 ERR scripts.py:910 -- Remaining processes will be killed.

I have now resorted to calling ray start within a script that contains the following construction to suppress errors being forwarded to Slurm:

eval "${@:2}" || echo "Exit with error code $? (suppressed)"

It’s not very satisfying, and forces me to block all exit codes. Does anyone know what might be causing this?
Thank you!

Starting and stopping ray on the login node (with a non-blocking ray start command) does not result in such errors.

Does anyone have any suggestions?

@Alex , can you please help address this?

Any thoughts, @tupui?

Sorry I don’t know right now. So far I just played with basics on a SLURM cluster and I did not see such issues.

Ok, thanks for looking. When I have more time, I may run a few more tests, such as starting/stopping within the same script, stopping when the started head node is not blocking, etc.

Hi all, I just did some more elementary testing by running a head node process and worker node process on the same machine, in two different terminals. The problem seems to lie with with the use of the --block option for ray start. The process that is started in blocked mode will exit while throwing errors:

Some Ray subprcesses exited unexpectedly:
  reaper [exit code=-15]
  gcs_server [exit code=0]
  ray_client_server [exit code=15]
  raylet [exit code=0]
  log_monitor [exit code=-15]

Remaining processes will be killed.

This happens both if I start the head process using --block and run ray stop in the other terminal, and vice versa.

can you do ray stop --force after running ray stop?

That does not help: ray stop has already terminated all processes, so ray stop --force does not do anything. Running ray stop --force directly interrupts the head node with other error codes, though:

Some Ray subprcesses exited unexpectedly:
  reaper [exit code=-9]
  gcs_server [exit code=-9]
  monitor [exit code=-9]
  ray_client_server [exit code=-9]
  dashboard [exit code=-9]
  raylet [exit code=-9]
  log_monitor [exit code=-9]

Remaining processes will be killed.

I see, it sounds like your main frustration is just with the output, and it shouldn’t be affecting anything? I’m not sure of any workaround but filed an issue to track this on github. [cli][usability] ray stop prints errors during graceful shutdown · Issue #25518 · ray-project/ray · GitHub

Yes, as far as I can tell it does not affect the calculations. But the errors, besides making the logs a little messy, also resulted in my suppressing all errors. That’s not ideal (especially for ray launcher scripts that I want to share with others), because it increases the likelihood that I suppress useful error messages related to other code issues.

Thanks for making the issue. I have clarified that I encounter this only when launching ray in --blocked mode.

@yic can you take a look?

Let me find someone to help with this case. We’ll move the discussion there.

1 Like

@simontindemans looks like someone is working on the GitHub issue. I’ll mark this question as resolved, feel free to open a new question if you encounter other issues.

Brilliant! Thank you all for starting the process to resolve this.

1 Like