Ray internally deleting object store object while the reference still persist

memr5 · June 18, 2024, 5:45am

How severe does this issue affect your experience of using Ray?

High: It blocks me to complete my task.

Frames stored in object store are sometime getting internally freed by ray (but the reference to it still persist) and resulting into: “Failed to retrieve object. The object has already been deleted by the reference counting protocol. This should not happen.”

Flow:

“FF” actor gets frame objects from a remote task “get”
“get” puts the frame into the object store
“FF” actor passes these frames to “Gen” (generator)
“FF” actor gets processed frame-objects from the generator and passes each frame object to “Printer” actor.
“Printer” actor tries to read the frame from the object store.

Issue:
Sometimes “Printer” actor fails to read the object from the store as it’s already been cleared by ray.

import ray
import asyncio
from typing import List
import numpy as np
import ray._private.internal_api

class Frame:
    frame: np.ndarray = None

    def clear_frame(self) -> None:
        if isinstance(self.frame, ray.ObjectRef):
            ray._private.internal_api.free(self.frame)
        else:
            del self.frame

@ray.remote
class Gen:

    async def process(self, l: List[ray.ObjectRef]):
        l = await asyncio.gather(*l)
        l: List[Frame] = list(filter(None, l))

        for frame in l:
            if np.random.uniform() <= 0.5:
                await asyncio.sleep(0.01)
                frame.clear_frame()
                frame.frame = ray.put(np.zeros((1,1,3)))
                yield frame
                continue
            frame.clear_frame()

@ray.remote
def get():
    f = None
    if np.random.uniform() <= 0.5:
        f = Frame()
        f.frame = ray.put(np.zeros((1,1,3)))
    return f

@ray.remote
class Printer:

    async def process(self, f: Frame) -> None:
        try:
            await asyncio.sleep(3)
            _ = await f.frame
            f.clear_frame()
        except Exception as ex:
            print(ex)

@ray.remote
class FF:

    def __init__(self, pr: Printer) -> None:
        self.pr = pr

    async def start(self):
        g = Gen.remote()
        while True:
            l = [get.remote() for _ in range(5)]
            async for ref in g.process.remote(l):
                try:
                    f: Frame = await ref
                    self.pr.process.remote(f)
                except Exception as ex:
                    print(ex)
            await asyncio.sleep(1)


async def main():
    pr = Printer.remote()
    ff = FF.remote(pr)
    await ff.start.remote()

asyncio.run(main())

Sample Output:

(Printer pid=653) Failed to retrieve object 00f777fc1bcbd0f2af58bd9c345a9854b6b6d3db060000007fe3f505. To see information about where this ObjectRef was created in Python, set the environment variable RAY_record_ref_creation_sites=1 during `ray start` and `ray.init()`.
(Printer pid=653) 
(Printer pid=653) The object has already been deleted by the reference counting protocol. This should not happen.
(Printer pid=653) Failed to retrieve object 00f777fc1bcbd0f2af58bd9c345a9854b6b6d3db06000000abe3f505. To see information about where this ObjectRef was created in Python, set the environment variable RAY_record_ref_creation_sites=1 during `ray start` and `ray.init()`.
(Printer pid=653) 
(Printer pid=653) The object has already been deleted by the reference counting protocol. This should not happen.
(Printer pid=653) Failed to retrieve object 00f777fc1bcbd0f2af58bd9c345a9854b6b6d3db06000000bee3f505. To see information about where this ObjectRef was created in Python, set the environment variable RAY_record_ref_creation_sites=1 during `ray start` and `ray.init()`.
(Printer pid=653) 
(Printer pid=653) The object has already been deleted by the reference counting protocol. This should not happen.
(Printer pid=653) Failed to retrieve object 00f777fc1bcbd0f2af58bd9c345a9854b6b6d3db06000000fbe3f505. To see information about where this ObjectRef was created in Python, set the environment variable RAY_record_ref_creation_sites=1 during `ray start` and `ray.init()`.
(Printer pid=653) 
(Printer pid=653) The object has already been deleted by the reference counting protocol. This should not happen.
(Printer pid=653) Failed to retrieve object 00f777fc1bcbd0f2af58bd9c345a9854b6b6d3db0600000010e4f505. To see information about where this ObjectRef was created in Python, set the environment variable RAY_record_ref_creation_sites=1 during `ray start` and `ray.init()`.
(Printer pid=653) 
(Printer pid=653) The object has already been deleted by the reference counting protocol. This should not happen.

shyampatel · June 24, 2024, 5:02am

I am also facing the similar issue in our pipeline. @jjyao Can you please check the given script and see, if you can re-generate the issue ? I guess, it can help to resolve following: Error while fetching data from Object Store

Sam_Chan · July 23, 2024, 6:01am

This sniffs something that could be potentially rather serious; can you please log a Bug on Github?

Can you also provide you Ray and Python version that you are on?

memr5 · July 23, 2024, 6:05am

Hey @Sam_Chan, I’m using python-3.8 & ray2.10

Sam_Chan · July 23, 2024, 6:57am

Thanks in the GH ticket can you supply a repro script; and can you also up grade to py39+ and latest Ray (or at least 2.9.3 forwards) and reconfirm repro.

memr5 · July 23, 2024, 8:23am

Sure!! I’ll once try to reproduce it with py39 and ray2.9.3+
And I’ll raise an issue in github.
Thanks

Topic		Replies	Views
Error while fetching data from Object Store Ray Core	5	287	May 2, 2024
RAM does not get released while no more references pointed to the Object Store Ray Core	1	339	March 3, 2023
[Core] Having trouble evicting objects Ray Core	6	549	June 9, 2021
ray.exceptions.ObjectLostError: Object xxx is lost due to node failure Ray Core	5	633	July 26, 2021
Object Storage Management with Ray Actor Tasks that does not need to be saved once executed Ray Core	4	243	February 28, 2024

Ray internally deleting object store object while the reference still persist

Related topics