How severe does this issue affect your experience of using Ray?
Medium: We are evaluating Ray before using it.
When using Ray as distributed multiprocessing
If one task invocation runs in several worker in different nodes parallelly. During the task running, I have the following questions:
If Ray worker process is killed on the node where the task is running on
or If the worker node is removed from the cluster
Will this running task on this node redistributed to the other nodes and recover?
Can the final results successfully be got in the client program or just receive the exception?
If adding one more Ray worker node to the cluster when the task is running. Will the task invocation be rescheduled to the newly added nodes. For example, task is running on 3 worker node, but not finished, if adding new worker node, will the task be redistributed to the newly added node.
If not, is there a way to handle the cases manually with Ray’s existing functionality.
If Ray worker process is killed on the node where the task is running on
or If the worker node is removed from the cluster
Will this running task on this node redistributed to the other nodes and recover?
Can the final results successfully be got in the client program or just receive the exception?
Ray has built-in fault tolerance mechanism for tasks (by default task is retried 3 times). See Task Fault Tolerance — Ray 2.3.0 for more details.
If adding one more Ray worker node to the cluster when the task is running. Will the task invocation be rescheduled to the newly added nodes. For example, task is running on 3 worker node, but not finished, if adding new worker node, will the task be redistributed to the newly added node.
An already running task will not be redistributed. You can manually cancel the task and re-submit via ray.cancel
Because in our scenario the nodes are allocated and removed dynamically from the resource management system. Our own program interacts with it and can handle the node add and removal ,we’d like to evaluate Ray for extending the ability of our own program to running parallel task internally instead of centralized Python multiprocess.
I think Ray can be a good fit. Even though running tasks cannot be redistributed but if they fail due to node removal, they will be automatically retried on other nodes. Also when new nodes are added, pending tasks will be able to run on them as well.
@tarjintor Let us know how you progressing. @jjyao has provided some insights into the questions.
Would you consider this question resolved?
Thanks for your interest in Ray.