Ray internally deleting object store object while the reference still persist

How severe does this issue affect your experience of using Ray?

  • High: It blocks me to complete my task.

Frames stored in object store are sometime getting internally freed by ray (but the reference to it still persist) and resulting into: “Failed to retrieve object. The object has already been deleted by the reference counting protocol. This should not happen.”

Flow:

  1. “FF” actor gets frame objects from a remote task “get”
  2. “get” puts the frame into the object store
  3. “FF” actor passes these frames to “Gen” (generator)
  4. “FF” actor gets processed frame-objects from the generator and passes each frame object to “Printer” actor.
  5. “Printer” actor tries to read the frame from the object store.

Issue:
Sometimes “Printer” actor fails to read the object from the store as it’s already been cleared by ray.

import ray
import asyncio
from typing import List
import numpy as np
import ray._private.internal_api

class Frame:
    frame: np.ndarray = None

    def clear_frame(self) -> None:
        if isinstance(self.frame, ray.ObjectRef):
            ray._private.internal_api.free(self.frame)
        else:
            del self.frame

@ray.remote
class Gen:

    async def process(self, l: List[ray.ObjectRef]):
        l = await asyncio.gather(*l)
        l: List[Frame] = list(filter(None, l))

        for frame in l:
            if np.random.uniform() <= 0.5:
                await asyncio.sleep(0.01)
                frame.clear_frame()
                frame.frame = ray.put(np.zeros((1,1,3)))
                yield frame
                continue
            frame.clear_frame()

@ray.remote
def get():
    f = None
    if np.random.uniform() <= 0.5:
        f = Frame()
        f.frame = ray.put(np.zeros((1,1,3)))
    return f

@ray.remote
class Printer:

    async def process(self, f: Frame) -> None:
        try:
            await asyncio.sleep(3)
            _ = await f.frame
            f.clear_frame()
        except Exception as ex:
            print(ex)

@ray.remote
class FF:

    def __init__(self, pr: Printer) -> None:
        self.pr = pr

    async def start(self):
        g = Gen.remote()
        while True:
            l = [get.remote() for _ in range(5)]
            async for ref in g.process.remote(l):
                try:
                    f: Frame = await ref
                    self.pr.process.remote(f)
                except Exception as ex:
                    print(ex)
            await asyncio.sleep(1)


async def main():
    pr = Printer.remote()
    ff = FF.remote(pr)
    await ff.start.remote()

asyncio.run(main())

Sample Output:

(Printer pid=653) Failed to retrieve object 00f777fc1bcbd0f2af58bd9c345a9854b6b6d3db060000007fe3f505. To see information about where this ObjectRef was created in Python, set the environment variable RAY_record_ref_creation_sites=1 during `ray start` and `ray.init()`.
(Printer pid=653) 
(Printer pid=653) The object has already been deleted by the reference counting protocol. This should not happen.
(Printer pid=653) Failed to retrieve object 00f777fc1bcbd0f2af58bd9c345a9854b6b6d3db06000000abe3f505. To see information about where this ObjectRef was created in Python, set the environment variable RAY_record_ref_creation_sites=1 during `ray start` and `ray.init()`.
(Printer pid=653) 
(Printer pid=653) The object has already been deleted by the reference counting protocol. This should not happen.
(Printer pid=653) Failed to retrieve object 00f777fc1bcbd0f2af58bd9c345a9854b6b6d3db06000000bee3f505. To see information about where this ObjectRef was created in Python, set the environment variable RAY_record_ref_creation_sites=1 during `ray start` and `ray.init()`.
(Printer pid=653) 
(Printer pid=653) The object has already been deleted by the reference counting protocol. This should not happen.
(Printer pid=653) Failed to retrieve object 00f777fc1bcbd0f2af58bd9c345a9854b6b6d3db06000000fbe3f505. To see information about where this ObjectRef was created in Python, set the environment variable RAY_record_ref_creation_sites=1 during `ray start` and `ray.init()`.
(Printer pid=653) 
(Printer pid=653) The object has already been deleted by the reference counting protocol. This should not happen.
(Printer pid=653) Failed to retrieve object 00f777fc1bcbd0f2af58bd9c345a9854b6b6d3db0600000010e4f505. To see information about where this ObjectRef was created in Python, set the environment variable RAY_record_ref_creation_sites=1 during `ray start` and `ray.init()`.
(Printer pid=653) 
(Printer pid=653) The object has already been deleted by the reference counting protocol. This should not happen.

I am also facing the similar issue in our pipeline. @jjyao Can you please check the given script and see, if you can re-generate the issue ? I guess, it can help to resolve following: Error while fetching data from Object Store

This sniffs something that could be potentially rather serious; can you please log a Bug on Github?

Can you also provide you Ray and Python version that you are on?

Hey @Sam_Chan, I’m using python-3.8 & ray2.10

Thanks in the GH ticket can you supply a repro script; and can you also up grade to py39+ and latest Ray (or at least 2.9.3 forwards) and reconfirm repro.

Sure!! I’ll once try to reproduce it with py39 and ray2.9.3+
And I’ll raise an issue in github.
Thanks

1 Like