Fetch for object reference timed out because no locations were found for the object

How severe does this issue affect your experience of using Ray?

  • Medium: It contributes to significant difficulty to complete my task, but I can work around it.

I have a pipeline with cascading process actors where each actors output references are passed to the other and the final outputs are waited on. The below error occurs in sporadic manner in a long running pipeline.

ray.exceptions.RayTaskError: e[36mray::Undistortion.undistort()e[39m (pid=48319, ip=10.40.40.110, actor_id=13ee0523ecd038ab42d60dc302000000, repr=Undistort)
  At least one of the input arguments for this task could not be computed:
ray.exceptions.ObjectFetchTimedOutError: Failed to retrieve object 49722a863e8b479f01138d7466c34018589fa72d0200000001000000. To see information about where this ObjectRef was created in Python, set the environment variable RAY_record_ref_creation_sites=1 during `ray start` and `ray.init()`.

Fetch for object 49722a863e8b479f01138d7466c34018589fa72d0200000001000000 timed out because no locations were found for the object. This may indicate a system-level bug.

How to debug or resolve this error ?

Environment details

Ubuntu: 20.04
python: 3.8.10
ray: 2.8.0

Hey Buva, TPM @ Anyscale here…couple follow up questions.

By cascading process actors do you mean you have your actors create more actors? Can you share a repro script (follow up: how sporadic and how long running)?

If it’s practical for you can you grab the latest ray version and see if it still replicates?

By cascading processes i meant to say a sequence operations performed by individual actors on a data and the each of the actor’s output reference is fed as input to the next actor in the sequence, here all the actor are created once at the beginning of the pipeline by a main supervising actor and handles the object/output reference passing between actors.

Not sure whether a reproducible scripts is possible as the original pipeline involves a much more complex cascading and parallel connection for data process flow, but will try if it is possible to reproduce the error with simple pipeline with less actor processes within short run duration.

The error appears very random,
1st instance came after running the pipeline for 7 days, 2nd instance came up after 1 hour and 3rd instance appeared after 1 day. The pipeline also had a change to run for more than 7 days (approx. 2 weeks) at a stretch without this error appearing. This is another reason for uncertainty in creating a reproducible script

will check if it is reproducing with latest ray version, but may take a bit of time to check depending on the amount of changes needed to the whole pipeline and the sporadic nature of the error.