Restart of raylet


When a raylet crashes, is it restarted by the current ray system so that workers can continue executing the computation possibly leading to better cluster utilization and improvement in job finish time?


Hi @asm582, the raylet will be restarted if you are using ray cluster launcher Ray Cluster Overview — Ray v1.4.1 which will monitor the health of the ray cluster.

Hi, @simon-mo thanks, does this mean that computation continues to execute on the same node by restarting the raylet process only without re-running the entire computation on the different node?

In ray1.4 I think I do not see raylet restarted once I kill a raylet manually, can you please confirm?

Hi @asm582, raylet should be restarted, if you have a reproduction please post on github issues. We generally checkpoint your computation to disk in case raylet or worker crashes.

Computation is retried, so it is not guaranteed to run in the same node that is crashed. Also, if you use plasma stores in your workload, you need to enable object_reconstruction. For more details about fault tolerance, check Fault Tolerance — Ray v2.0.0.dev0