Plasma Object Ownership, Actors and ObjectLostError

I am currently on a single node, using Actors to process predictions using a predict method. In this function I batch my data for a set of N actors (with loaded model) to predict on the list of [(filename1, inputs1), (filename2, inputs2)] .

for each prediction I append the results in the same manner. and return

return [(file , ray.put(result)) for (file, result) in zip([filename1, filename2], [y1, y2]])]

This tasks are created inside a function. When the function finished, it appends and collects the results of all predictions from the set of N actors, and returns this collected list of (filenames, object_references) which gets passed to another function to be saved to file in parallel.

I keep on getting:

ObjectLostError: Object c37ebc3b266087738b206d90405dd14436e13cba1400000004000000 is lost due to node failure.

After a lot of debugging, I found that if I return the list of actors outside the function, I did not get the error, which means that the references were somehow tied to the actor.

I also tried wrapping them object references in lists so to keep the reference when I return them out of the function, but no luck.

I am not sure if this is a bug or not, but it seems counterintuitive.

I believe this is the cause of the issue/error posted in this thread: ray.exceptions.ObjectLostError: Object xxx is lost due to node failure

Hi @Javier_Bosch, do you have a reproduction code that you can share? Or even pseudo code would probably help to see if there’s anything obvious.

Which Ray version are you using?

I think I might know why this happens, but is it possible to explain your script in code? I am a little confused if I understand that correctly;

@ray.remote
class Actor:
    def predict(self, batch):
        # do something
        return [(file, ray.put(result))....]

So I understand this is your actor code, but I am still a little confuse about

This tasks are created inside a function. When the function finished, it appends and collects the results of all predictions from the set of N actors, and returns this collected list of (filenames, object_references) which gets passed to another function to be saved to file in parallel.

Hi @kai , I am using version 1.4. I had this problem with 1.2 and upgrade to 1.4. The error remained. I believe I have a minimally reproducible code. Sometimes It fails, sometimes it doesn’t. I am not sure why that happens. I will try to get some time to post my code soon.