Ray internally deleting object store object while the reference still persist

How severe does this issue affect your experience of using Ray?

  • High: It blocks me to complete my task.

Frames stored in object store are sometime getting internally freed by ray (but the reference to it still persist) and resulting into: “Failed to retrieve object. The object has already been deleted by the reference counting protocol. This should not happen.”

Flow:

  1. “FF” actor gets frame objects from a remote task “get”
  2. “get” puts the frame into the object store
  3. “FF” actor passes these frames to “Gen” (generator)
  4. “FF” actor gets processed frame-objects from the generator and passes each frame object to “Printer” actor.
  5. “Printer” actor tries to read the frame from the object store.

Issue:
Sometimes “Printer” actor fails to read the object from the store as it’s already been cleared by ray.

import ray
import asyncio
from typing import List
import numpy as np
import ray._private.internal_api

class Frame:
    frame: np.ndarray = None

    def clear_frame(self) -> None:
        if isinstance(self.frame, ray.ObjectRef):
            ray._private.internal_api.free(self.frame)
        else:
            del self.frame

@ray.remote
class Gen:

    async def process(self, l: List[ray.ObjectRef]):
        l = await asyncio.gather(*l)
        l: List[Frame] = list(filter(None, l))

        for frame in l:
            if np.random.uniform() <= 0.5:
                await asyncio.sleep(0.01)
                frame.clear_frame()
                frame.frame = ray.put(np.zeros((1,1,3)))
                yield frame
                continue
            frame.clear_frame()

@ray.remote
def get():
    f = None
    if np.random.uniform() <= 0.5:
        f = Frame()
        f.frame = ray.put(np.zeros((1,1,3)))
    return f

@ray.remote
class Printer:

    async def process(self, f: Frame) -> None:
        try:
            await asyncio.sleep(3)
            _ = await f.frame
            f.clear_frame()
        except Exception as ex:
            print(ex)

@ray.remote
class FF:

    def __init__(self, pr: Printer) -> None:
        self.pr = pr

    async def start(self):
        g = Gen.remote()
        while True:
            l = [get.remote() for _ in range(5)]
            async for ref in g.process.remote(l):
                try:
                    f: Frame = await ref
                    self.pr.process.remote(f)
                except Exception as ex:
                    print(ex)
            await asyncio.sleep(1)


async def main():
    pr = Printer.remote()
    ff = FF.remote(pr)
    await ff.start.remote()

asyncio.run(main())

Sample Output:

(Printer pid=653) Failed to retrieve object 00f777fc1bcbd0f2af58bd9c345a9854b6b6d3db060000007fe3f505. To see information about where this ObjectRef was created in Python, set the environment variable RAY_record_ref_creation_sites=1 during `ray start` and `ray.init()`.
(Printer pid=653) 
(Printer pid=653) The object has already been deleted by the reference counting protocol. This should not happen.
(Printer pid=653) Failed to retrieve object 00f777fc1bcbd0f2af58bd9c345a9854b6b6d3db06000000abe3f505. To see information about where this ObjectRef was created in Python, set the environment variable RAY_record_ref_creation_sites=1 during `ray start` and `ray.init()`.
(Printer pid=653) 
(Printer pid=653) The object has already been deleted by the reference counting protocol. This should not happen.
(Printer pid=653) Failed to retrieve object 00f777fc1bcbd0f2af58bd9c345a9854b6b6d3db06000000bee3f505. To see information about where this ObjectRef was created in Python, set the environment variable RAY_record_ref_creation_sites=1 during `ray start` and `ray.init()`.
(Printer pid=653) 
(Printer pid=653) The object has already been deleted by the reference counting protocol. This should not happen.
(Printer pid=653) Failed to retrieve object 00f777fc1bcbd0f2af58bd9c345a9854b6b6d3db06000000fbe3f505. To see information about where this ObjectRef was created in Python, set the environment variable RAY_record_ref_creation_sites=1 during `ray start` and `ray.init()`.
(Printer pid=653) 
(Printer pid=653) The object has already been deleted by the reference counting protocol. This should not happen.
(Printer pid=653) Failed to retrieve object 00f777fc1bcbd0f2af58bd9c345a9854b6b6d3db0600000010e4f505. To see information about where this ObjectRef was created in Python, set the environment variable RAY_record_ref_creation_sites=1 during `ray start` and `ray.init()`.
(Printer pid=653) 
(Printer pid=653) The object has already been deleted by the reference counting protocol. This should not happen.

I am also facing the similar issue in our pipeline. @jjyao Can you please check the given script and see, if you can re-generate the issue ? I guess, it can help to resolve following: Error while fetching data from Object Store

This sniffs something that could be potentially rather serious; can you please log a Bug on Github?

Can you also provide you Ray and Python version that you are on?

Hey @Sam_Chan, I’m using python-3.8 & ray2.10

Thanks in the GH ticket can you supply a repro script; and can you also up grade to py39+ and latest Ray (or at least 2.9.3 forwards) and reconfirm repro.

Sure!! I’ll once try to reproduce it with py39 and ray2.9.3+
And I’ll raise an issue in github.
Thanks

1 Like

have you ever found the issue? i am still struggling with such an issue, and i cant make sense of it at all

Could you clarify which specific issue you are referring to? There are many types of issues discussed in Ray’s forums, such as actor creation failures, serialization errors, or cluster connection problems. If you can provide the error message or a brief description of your problem, I can point you to relevant solutions or troubleshooting steps.

For example, if you are struggling with actor creation or debugging failures, Ray provides guidance on debugging application, intentional system, and unintended system failures, including how to interpret error messages and use logs for troubleshooting. See the official Debugging Failures guide for more details. Let me know if you want more detail or if your issue is different.

Sources:

Hint: Mention @RunLLM in the post for followups.

@RunLLM I have a bunch of very similar processing steps done in ray remotes.

one of those ray remotes is on another node cause it needs a gpu.

anyway it processes fine, but after the ray remote returns the result, and I put the future into the next processing step, in there ray cant seam to find the object anymore as I get The object has already been deleted by the reference counting protocol. This should not happen.

The object i return has a custom getstate and setstate that converts one of the attributes (a modin df) into a list of ray_refs. and the setstate then creates a modin df from that list of refs. This was needed so it still is a modin df after deserialisation.
The exact same stuff is happening in all other processing steps that work just fine, just the one step has this issue.

It seams to be something is wrong with the reference counting, but it is really heavily blocking me

The error “The object has already been deleted by the reference counting protocol. This should not happen.” is a ReferenceCountingAssertionError in Ray, and it indicates that the object was deleted from the object store while there was still a reference to it. This can happen if the distributed reference counting protocol is not correctly tracking all references, especially when custom serialization/deserialization logic is involved or when ObjectRefs are passed through intermediate workers or actors that may die or lose references before the final consumer gets them.

According to ReferenceCountingAssertionError docs and this Ray issue, this can occur if an intermediate worker (or actor) that holds an ObjectRef dies or fails to notify its parent about new child refs, or if custom serialization logic (like your custom __getstate__/__setstate__) does not properly preserve all references. The workaround is to ensure that ObjectRefs are passed directly from the owner to the final consumer, and to avoid passing them through intermediate nodes that may exit or lose references. Also, double-check that your custom serialization logic does not drop or overwrite ObjectRefs in a way that breaks Ray’s reference counting.

Would you like a step-by-step breakdown of how to debug or refactor your code to avoid this issue?

Sources:

Hint: Mention @RunLLM in the post for followups.

@RunLLM sure give me debugging step by step.

But i repeat. in all other cases the same class is used, with the same custom serialisation.
Also I dont seam to see any intermediate actor or task, at least none i can spare.

Certainly! Here’s a step-by-step debugging approach, tailored to your scenario:

  1. Reproduce the Error in Minimal Code: Isolate the problematic step in a minimal script, using the same class and serialization logic. This helps confirm if the issue is specific to the GPU node or the data flow in that step (Ray issue #18456).

  2. Check ObjectRef Lifetimes: Ensure that all ObjectRefs (especially those in your custom __getstate__/__setstate__) are not being deleted or going out of scope before the next processing step consumes them. Use ray.util.state.summarize_objects() to inspect live references and see if the object is still present after the GPU step.

  3. Direct Passing of ObjectRefs: Make sure the ObjectRefs are passed directly from the producing task to the consuming task, without being routed through actors or tasks that might terminate early or lose references (Ray issue #18456).

  4. Check for Node/Worker Failures: Review Ray logs and the dashboard for any worker or node failures, especially on the GPU node. If a worker dies before reference updates propagate, the object may be deleted prematurely.

  5. Validate Custom Serialization: Double-check your custom __getstate__ and __setstate__ to ensure all ObjectRefs are correctly serialized and deserialized, and that no references are dropped or replaced with new ones unexpectedly.

  6. Test with No Custom Serialization: Temporarily remove or bypass your custom serialization logic to see if the problem persists. If it disappears, the issue is likely in the custom logic.

  7. Ray Version: Ensure you are using a recent Ray version, as some reference counting bugs have been fixed in newer releases (Ray issue #31421).

Would you like more detail on any of these steps, or help with a code example for a specific check?

Sources:

Hint: Mention @RunLLM in the post for followups.

@RunLLM The refs are running though a few actors and tasks.

driver {
   some obj 
   obj = remote task(obj) {
      remote actor.remote(obj)
      remote actor.some_processing.remote()
      obj = ray.get(remote actor.get_data.remote())
      postprocess(obj)
      return obj
    }
    other_remote_task.remote(obj)  # dies

the actors and tasks before the failure dont fail and all exit gracefully afaik.

no custom serialisation does not work, since it wrongly changes types from modin df to df. which is very bad for very large data

Given your flow, the error is likely due to Ray’s distributed reference counting protocol: if an ObjectRef is passed through intermediate actors/tasks and one of them dies or loses the reference before the final consumer, the object can be deleted prematurely, resulting in the “object has already been deleted” error (Ray issue #18456). Even if actors/tasks exit “gracefully,” if the reference is not properly propagated or is lost before the final use, this can still occur.

To debug, ensure that every actor/task in the chain maintains a reference to the ObjectRef until all downstream consumers are done. Avoid deleting or letting go of the ObjectRef in any intermediate step before the final processing is complete. Also, double-check that your custom serialization for Modin DataFrames preserves all ObjectRefs and does not inadvertently drop or replace them.

Would you like a code example or more detail on how to ensure references are held correctly in such a workflow?

Sources:

Hint: Mention @RunLLM in the post for followups.