Detached actor detect oom restart

Medium

Hello,
I’m currently detecting whether detached actors were restarted by checking NumRestarts from ray._private.state.actors.

I would like to distinguish now whether the actor was killed due to out of memory so that the software can signal to the users whether the problem was memory related or a bug (segfault)

Is there a simple way to do this during construction of the killed detached actor? I found DeathCause, but when I parse it, it had no text

  msg = ray._private.state.actors(ray.get_runtime_context().get_actor_id())["DeathCause"]
  msg = json.loads(MessageToJson(msg))

It is unfortunately not possible witht max_restarts API. You should probably create an issue.

If you want to achieve it, alternatively, you can keep tracking of health of your actors via ray.get(actor.ray_ready.remote()) and catch the death cause manually restart them

1 Like