How to find if an ObjectRef failed without an expensive ray.get() call?

ray_user_1234 · June 8, 2021, 8:44pm

Hi all,

I have a question about checking ObejctRefs for failure before calling ray.get(). The use case is that I may have many outstanding ObjectRefs, and if any fail I want to exit my program immediately. However, these ObjectRefs are intermediate values, and I’d rather not call ray.get() on the head node, because the contents are pretty hefty. My understanding is that using ray.get() would incur the IO cost of moving those values to the head node. Ideally I would be able to determine if any failed, and if not, invoke more ray remote functions to consume the intermediate ObjectRefs.

It looks like maybe in the past you could use ray.error_info?

Maybe I could call ObjectRef. as_future and leverage the exception check on the future (Futures — Python 3.9.5 documentation) ? Or is that only present after an expensive ray.get() call?

Tomas · June 17, 2021, 4:59am

Hi,

I have a similar problem. I am trying to invoke a sequence of ray.remote calls and only issue the next set if all the previous calls succeeded. I use finished = ray.wait(..., fetch_local=False) to wait for completion but then I’d like to check (in an inexpensive way) if all the elements of finished ran without an exception.
How can I go about doing that with ray?
Thanks,

Tom

shiranbi · September 14, 2022, 5:31pm

Hi
I have the same issue Tomas mentioned. Can anyone help us out?
Thanks

Stephanie_Wang · September 14, 2022, 11:14pm

Unfortunately there’s no built-in API to do this, but I think the ray.wait method is on the right track. To avoid transferring the data back to the driver, you could try submitting a second round of no-op tasks and make sure that these complete, like so:

@ray.remote
def noop(x):
  return
xs = [expensive_task.remote() for _ in range(10)]
finished, _ = ray.wait(xs, fetch_local=False)
for x in finished:
  try:
    ray.get(noop.remote(x))
  except ray.exceptions.RayTaskError as e:
    print("Task", x, "failed")

shiranbi · September 15, 2022, 3:11am

@Stephanie_Wang Would it make sense to request to extend the Ray.wait method or does it go completely against the concept?

Stephanie_Wang · September 19, 2022, 6:22pm

It would probably be better to add a new API call to check whether a particular ObjectRef has errored without blocking the caller. Feel free to request this on GitHub!

Topic		Replies	Views
"Wait requires a list of unique object refs" error Ray Core	2	253	May 2, 2023
"Wait requires a list of unique object refs" error Ray Core	1	304	May 16, 2023
ray.exceptions.ObjectLostError: Object xxx is lost due to node failure Ray Core	5	632	July 26, 2021
Unittest __init__ exception handling Ray Core	2	258	December 20, 2023
Getting reference counting assertation error when storing ObjectRefs in class variables Ray Core	6	647	September 22, 2022

How to find if an ObjectRef failed without an expensive ray.get() call?

Related topics