Task retry and job termination behavior after signals

Trying to understand task fault behavior using the code from
Retry behavior is as expected with os.exit(N) for several values tested, but if sleep is extended and the remote task dies because of a signal the entire job is terminated.

Are causes for premature job termination documented anywhere?


cc @sangcho do you have context on this one?

My context is trying to test fault tolerance for an application submitting a number of tasks to a single remote function, using ray1.7. I naively tried killing an instance of the remote function. A more complete description of what happens:
1 using sigterm, the task is immediate returned to the pending ray.get() with no retry
2 usng sigkill, all running tasks stop immediately and the driver quits
3 killing a raylet caused the contained task to be retried

There is more to the story with regards to GPUs. In scenario 2 the GPUs are left with things still running in them and the ray cluster has to be restarted to clean them up. I should add that this GPU observation has not been carefully reproduced yet.

Can you provide a code example? I might not 100% understand your workload now (especially submitting a number of tasks to a single remote function, is a bit unclear to me).

See first code sample at link referenced beginning of this topic. The only modifications were to the exit RC and extending the sleep to give time to do kills.