Task retry and job termination behavior after signals

eddie · December 5, 2021, 11:12pm

Trying to understand task fault behavior using the code from
https://docs.ray.io/en/latest/fault-tolerance.html?highlight=fault
Retry behavior is as expected with os.exit(N) for several values tested, but if sleep is extended and the remote task dies because of a signal the entire job is terminated.

Are causes for premature job termination documented anywhere?

Thanks,
Eddie

Chen_Shen · December 7, 2021, 6:25pm

cc @sangcho do you have context on this one?

eddie · December 7, 2021, 7:54pm

My context is trying to test fault tolerance for an application submitting a number of tasks to a single remote function, using ray1.7. I naively tried killing an instance of the remote function. A more complete description of what happens:
1 using sigterm, the task is immediate returned to the pending ray.get() with no retry
2 usng sigkill, all running tasks stop immediately and the driver quits
3 killing a raylet caused the contained task to be retried

There is more to the story with regards to GPUs. In scenario 2 the GPUs are left with things still running in them and the ray cluster has to be restarted to clean them up. I should add that this GPU observation has not been carefully reproduced yet.

sangcho · December 10, 2021, 1:00am

Can you provide a code example? I might not 100% understand your workload now (especially submitting a number of tasks to a single remote function, is a bit unclear to me).

eddie · December 10, 2021, 1:29pm

See first code sample at link referenced beginning of this topic. The only modifications were to the exit RC and extending the sleep to give time to do kills.

Topic		Replies	Views
Killing driver does not kill tasks in Ray on minikube Kubernetes	8	708	April 22, 2021
Sys.exit from inside actor function gives unexpected results Ray Core	3	388	November 29, 2022
Ray worker dies with SYSTEM_ERROR_EXIT Ray Core	3	1098	June 28, 2022
How to identify cause of "Exit signal" killing active worker Ray Core	5	928	January 6, 2022
How to get ray task again while the driver submit the task died? Ray Core	2	370	December 13, 2022

Task retry and job termination behavior after signals

Related topics