Ray worker dies with SYSTEM_ERROR_EXIT

How severe does this issue affect your experience of using Ray?

  • High: It blocks me to complete my task.

Hi I am running Ray on a Slurm cluster where I start a lot of tasks which in turn I cancel quite often. A task takes on average 20 seconds to finish, where I might cancel a task only a few seconds after I started it. I catch the ray.cancel within the task as a KeyboardInterrupt where I then do some cleaning up.

My issue is sometimes when I cancel a task, the cancelation does not seem to succeed. In particular, the KeyboardInterrupt is not received by the task. It seems like the worker the task is running on just dies silently. I can find the following output in the GCS log output:

[2022-06-27 18:19:50,327 W 139409 139409] (gcs_server) gcs_worker_manager.cc:43: Reporting worker exit, worker id = 9e39c872f30f95b8c1083d1f78f21ef366bb09034f1369544492c934, node id = 34453e0c12fe101f86f3c7cf33a6f02d83182c6da8c8ea2be7d59a9f, address = 10.10.70.122, exit_type = SYSTEM_ERROR_EXIT0. Unintentional worker failures have been reported. If there are lots of this logs, that might indicate there are unexpected failures in the cluster.
[2022-06-27 18:19:50,327 W 139409 139409] (gcs_server) gcs_actor_manager.cc:848: Worker 9e39c872f30f95b8c1083d1f78f21ef366bb09034f1369544492c934 on node 34453e0c12fe101f86f3c7cf33a6f02d83182c6da8c8ea2be7d59a9f exits, type=SYSTEM_ERROR_EXIT, has creation_task_exception = 0

In addition, the worker logs show no more activity after the cancel.

Is there a way to at least catch this inactive workers while running? It would be fine if I had to restart the task for this…

Thank you!

hi @elo, depending on the state of the task there might be race conditions between the cancel signal and worker termination signal; Also I’m not sure I fully understand your way of canceling tasks

I catch the ray.cancel within the task as a KeyboardInterrupt where I then do some cleaning up.

Do you have a simple repro script for the issues you are encountering?

Hi, thank you for your answer. I am not sure about the race condition. I send the cancel signal way before the task would be done. What may happen is: in the task I cancel, I call ray.remote on some actor to send data over. May receiving the cancel signal while calling ray.remote lead to this issue?

The way I am canceling tasks is: I have a monitor function running as ray actor that periodically checks some conditions. In case these are fulfilled, it sends a cancel signal thorough ray.cancel to the task. In the task itself, I catch the signal as KeyboardInterrupt, do some processing and then return. The return is important since in a main loop I am using wait to get the tasks results to decided which tasks to run next. I hope that makes sense.

Unfortunately, I do not have a simple script since, as described above it is rather complex. I will try to get one but I am not sure that will work since on the cluster the error only occurs very seldom.

Is there any way I could check for the dead worker within the skript to then just restart the task?

Thanky you

If you are using non-actor tasks, you can rely on Ray’s internal task retries. It should actually restart the task automatically if the original worker dies. You can find more information about this feature here in the docs.