Handling Exceptions from list of tasks using ray.get

Javier_Bosch · June 16, 2021, 10:32pm

What is the correct way of handling an exception from a a list of tasks executed with ray.get?

ray.init()


@ray.remote(max_retries=5)
def f(i):
    try:
        save(i)
    except Exception:
        raise Exception

 x = ray.get[f.remote(i) for i in range(20)]

whereby it is possible that one of the tasks in f could raise an exception, but you would like the other tasks to complete.

A use case for this would be saving data to disk. If a save from one of the tasks fails for a dataset in one of the task workers, then a solution would be to retry the function.

What would be the best way to handle this problem. Also I am not referring to a workercrasherror, but I guess it can be thought of as inevitably crashing when an Exception is raised.

I have tried:

try:
    x = ray.get([f.remote(i) for i in range(20)])
except RayError as e: 
    print(e.pid)

However, I only every get back the pid of 1 of the tasks that fails and not multiple pid(s) if more than one task raises and Exception.

sangcho · June 16, 2021, 11:33pm

Have you tried using ray.wait? E.g.,

# There will be at max `num_returns` length of ready per each function call
ready, unready = ray.wait([f.remote(i) for i in range(20)], num_returns=1) 
while unready:
    try:
        ray.get(ready)
    except Execption as e:
        # do whatever you want
    ready, unready = ray.wait(unready, num_returns=1)

Javier_Bosch · June 16, 2021, 11:50pm

I am not sure this would solve the problem.

If the exception occurs, would it be assigned to unready. If so, this would trigger an infinite loop.

Otherwise, I am unsure how to gracefully catch an errors from 1 or more tasks in a call to ray.get[list_of_tasks]

sangcho · June 17, 2021, 12:04am

If the exception occurs, would it be assigned to unready . If so, this would trigger an infinite loop.

No, if the object ref contains the exception, it is considered as ready (ready, but has the error message as a content <= impl detail).

Otherwise, I am unsure how to gracefully catch an errors from 1 or more tasks in a call to ray.get[list_of_tasks]

As I mentioned in the code block, setting num_returns=1 will always ensure that the length of ready is always 1. So you can handle each object ref using this way.

Javier_Bosch · June 17, 2021, 12:16am

Wouldn’t this approach defeat the benefits of parallelism?

sangcho · June 17, 2021, 12:29am

ray.wait will return the future that is ready for the first time, and ray.get is guaranteed to return right away because ray.wait prepares the object locally already. It is a standard pattern in ray. Also note that num_returns is 1 by default (so it is not different from using regular ray.wait). Which part do you think this defeats the benefit of parallelism?

Javier_Bosch · June 17, 2021, 6:18am

I was thinking num_returns=1

sangcho · June 17, 2021, 6:32am

num_returns is 1 by default. So it should work in the same way as using regular ray.wait (note all tasks are running in parallel in the background already. You just bring 1 “ready” object locally using ray.wait)

Topic		Replies	Views
Raise exception early when executing a lazy computation graph Ray Core	1	296	November 22, 2022
How to suppress TaskCancelledError without RAY_IGNORE_UNHANDLED_ERRORS? Ray Core	3	539	February 9, 2023
How do you gracefully handle "actor died"? Ray Core	2	166	March 17, 2024
Unittest __init__ exception handling Ray Core	2	238	December 20, 2023
Task Run Exception Ray Core	1	1706	February 22, 2021

Handling Exceptions from list of tasks using ray.get

Related topics