How to find if an ObjectRef failed without an expensive ray.get() call?

Hi all,

I have a question about checking ObejctRefs for failure before calling ray.get(). The use case is that I may have many outstanding ObjectRefs, and if any fail I want to exit my program immediately. However, these ObjectRefs are intermediate values, and I’d rather not call ray.get() on the head node, because the contents are pretty hefty. My understanding is that using ray.get() would incur the IO cost of moving those values to the head node. Ideally I would be able to determine if any failed, and if not, invoke more ray remote functions to consume the intermediate ObjectRefs.

It looks like maybe in the past you could use ray.error_info?

Maybe I could call ObjectRef. as_future and leverage the exception check on the future (Futures — Python 3.9.5 documentation) ? Or is that only present after an expensive ray.get() call?

Hi,

I have a similar problem. I am trying to invoke a sequence of ray.remote calls and only issue the next set if all the previous calls succeeded. I use finished = ray.wait(..., fetch_local=False) to wait for completion but then I’d like to check (in an inexpensive way) if all the elements of finished ran without an exception.
How can I go about doing that with ray?
Thanks,

Tom

Hi
I have the same issue Tomas mentioned. Can anyone help us out?
Thanks

Unfortunately there’s no built-in API to do this, but I think the ray.wait method is on the right track. To avoid transferring the data back to the driver, you could try submitting a second round of no-op tasks and make sure that these complete, like so:

@ray.remote
def noop(x):
  return
xs = [expensive_task.remote() for _ in range(10)]
finished, _ = ray.wait(xs, fetch_local=False)
for x in finished:
  try:
    ray.get(noop.remote(x))
  except ray.exceptions.RayTaskError as e:
    print("Task", x, "failed")

@Stephanie_Wang Would it make sense to request to extend the Ray.wait method or does it go completely against the concept?

It would probably be better to add a new API call to check whether a particular ObjectRef has errored without blocking the caller. Feel free to request this on GitHub!