Questions about using Ray as distributed multiprocessing.Pool

jaslip · March 3, 2023, 5:15am

How severe does this issue affect your experience of using Ray?

Medium: We are evaluating Ray before using it.

When using Ray as distributed multiprocessing
If one task invocation runs in several worker in different nodes parallelly. During the task running, I have the following questions:

If Ray worker process is killed on the node where the task is running on
or If the worker node is removed from the cluster
Will this running task on this node redistributed to the other nodes and recover?
Can the final results successfully be got in the client program or just receive the exception?
If adding one more Ray worker node to the cluster when the task is running. Will the task invocation be rescheduled to the newly added nodes. For example, task is running on 3 worker node, but not finished, if adding new worker node, will the task be redistributed to the newly added node.
If not, is there a way to handle the cases manually with Ray’s existing functionality.

jjyao · March 3, 2023, 5:12pm

If Ray worker process is killed on the node where the task is running on
or If the worker node is removed from the cluster
Will this running task on this node redistributed to the other nodes and recover?
Can the final results successfully be got in the client program or just receive the exception?

Ray has built-in fault tolerance mechanism for tasks (by default task is retried 3 times). See Task Fault Tolerance — Ray 2.3.0 for more details.

If adding one more Ray worker node to the cluster when the task is running. Will the task invocation be rescheduled to the newly added nodes. For example, task is running on 3 worker node, but not finished, if adding new worker node, will the task be redistributed to the newly added node.

An already running task will not be redistributed. You can manually cancel the task and re-submit via ray.cancel

jjyao · March 3, 2023, 5:13pm

Also would like to know why you want to redistribute a running task.

jaslip · March 4, 2023, 1:26pm

Because in our scenario the nodes are allocated and removed dynamically from the resource management system. Our own program interacts with it and can handle the node add and removal ,we’d like to evaluate Ray for extending the ability of our own program to running parallel task internally instead of centralized Python multiprocess.

jjyao · March 6, 2023, 5:37pm

I think Ray can be a good fit. Even though running tasks cannot be redistributed but if they fail due to node removal, they will be automatically retried on other nodes. Also when new nodes are added, pending tasks will be able to run on them as well.

tarjintor · March 7, 2023, 9:11am

Will the ray head node also could be removed, not only the worker nodes?

jjyao · March 14, 2023, 4:12pm

Ray head node cannot be removed otherwise the ray cluster will fail.

Jules_Damji · March 14, 2023, 4:20pm

@tarjintor Let us know how you progressing. @jjyao has provided some insights into the questions.
Would you consider this question resolved?
Thanks for your interest in Ray.

Topic		Replies	Views
Task distribution is not happening with new nodes Ray Core	12	1667	November 26, 2021
Newbi Question: Worker Fault Tolerance?	4	561	February 28, 2022
Ray job is stuck when node worker runs on is killed Ray Core	3	1708	July 1, 2022
A few questions about task scheduling and retry Ray Core	1	342	December 15, 2021
Ray tasks lost on node failiure, how to debug? Ray Core	5	631	June 17, 2021

Questions about using Ray as distributed multiprocessing.Pool

Related topics