Questions about fault tolerance in a Ray cluster

jaez · December 15, 2021, 5:54pm

Hello, I have two questions about fault tolerance in a Ray cluster:

When a worker node failed, how do tasks owned by workers on that node get rescheduled onto other running nodes?
When the head node failed, would worker nodes continue scheduling and running tasks they own, or would they stop since they cannot talk to GCS? If the latter, is there a timeout value they would wait for GCS?

Thank you!

Topic		Replies	Views
Newbi Question: Worker Fault Tolerance?	4	552	February 28, 2022
Ray worker behaviour Ray Core	8	588	April 10, 2023
Ray Serve Head fault tolerance Ray Serve	3	332	October 13, 2023
Is Queue in Ray fault tolerant？ Ray Core	0	101	April 22, 2024
High Availability for Head node of Ray clusters Ray Clusters	1	740	June 5, 2021