Detached actors: how to detect "A worker died or was killed"

Medium

Hello,

We have a simulation environment that is based on Ray detached actors (workers) that consume tasks given to them by a single detached ray actor (scheduler). The results are queued back to the software (submitter) that submitted the job to the scheduler. The submitter can be launched multiple times, meaning we lose Ray generated logs from the detached actors (There was an issue about it already in Github).

Some of our code is written in cpp; sometimes, some simulations can die due to segfaults. The problem is how to catch these “A worker died or was killed while executing” by the submitter so I can act on it.

Is there any way to get a callback for “A worker died or was killed while” after ray.init()? Maybe I should start polling the restarts of the detached actors, but that does not sound like a good solution.

The submitter does not launch the ray cluster. Another software manages it.

Any suggestions are appreciated, thanks

I should start polling the restarts of the detached actors, but that does not sound like a good solution.

I think it is the best solution now (and several ray libraries use a similar approach, for example train or serve).