Hello, I have two questions about fault tolerance in a Ray cluster:
- When a worker node failed, how do tasks owned by workers on that node get rescheduled onto other running nodes?
- When the head node failed, would worker nodes continue scheduling and running tasks they own, or would they stop since they cannot talk to GCS? If the latter, is there a timeout value they would wait for GCS?
Thank you!