How severe does this issue affect your experience of using Ray?
- High: It blocks me to complete my task.
Frames stored in object store are sometime getting internally freed by ray (but the reference to it still persist) and resulting into: “Failed to retrieve object. The object has already been deleted by the reference counting protocol. This should not happen.”
Flow:
- “FF” actor gets frame objects from a remote task “get”
- “get” puts the frame into the object store
- “FF” actor passes these frames to “Gen” (generator)
- “FF” actor gets processed frame-objects from the generator and passes each frame object to “Printer” actor.
- “Printer” actor tries to read the frame from the object store.
Issue:
Sometimes “Printer” actor fails to read the object from the store as it’s already been cleared by ray.
import ray
import asyncio
from typing import List
import numpy as np
import ray._private.internal_api
class Frame:
frame: np.ndarray = None
def clear_frame(self) -> None:
if isinstance(self.frame, ray.ObjectRef):
ray._private.internal_api.free(self.frame)
else:
del self.frame
@ray.remote
class Gen:
async def process(self, l: List[ray.ObjectRef]):
l = await asyncio.gather(*l)
l: List[Frame] = list(filter(None, l))
for frame in l:
if np.random.uniform() <= 0.5:
await asyncio.sleep(0.01)
frame.clear_frame()
frame.frame = ray.put(np.zeros((1,1,3)))
yield frame
continue
frame.clear_frame()
@ray.remote
def get():
f = None
if np.random.uniform() <= 0.5:
f = Frame()
f.frame = ray.put(np.zeros((1,1,3)))
return f
@ray.remote
class Printer:
async def process(self, f: Frame) -> None:
try:
await asyncio.sleep(3)
_ = await f.frame
f.clear_frame()
except Exception as ex:
print(ex)
@ray.remote
class FF:
def __init__(self, pr: Printer) -> None:
self.pr = pr
async def start(self):
g = Gen.remote()
while True:
l = [get.remote() for _ in range(5)]
async for ref in g.process.remote(l):
try:
f: Frame = await ref
self.pr.process.remote(f)
except Exception as ex:
print(ex)
await asyncio.sleep(1)
async def main():
pr = Printer.remote()
ff = FF.remote(pr)
await ff.start.remote()
asyncio.run(main())
Sample Output:
(Printer pid=653) Failed to retrieve object 00f777fc1bcbd0f2af58bd9c345a9854b6b6d3db060000007fe3f505. To see information about where this ObjectRef was created in Python, set the environment variable RAY_record_ref_creation_sites=1 during `ray start` and `ray.init()`.
(Printer pid=653)
(Printer pid=653) The object has already been deleted by the reference counting protocol. This should not happen.
(Printer pid=653) Failed to retrieve object 00f777fc1bcbd0f2af58bd9c345a9854b6b6d3db06000000abe3f505. To see information about where this ObjectRef was created in Python, set the environment variable RAY_record_ref_creation_sites=1 during `ray start` and `ray.init()`.
(Printer pid=653)
(Printer pid=653) The object has already been deleted by the reference counting protocol. This should not happen.
(Printer pid=653) Failed to retrieve object 00f777fc1bcbd0f2af58bd9c345a9854b6b6d3db06000000bee3f505. To see information about where this ObjectRef was created in Python, set the environment variable RAY_record_ref_creation_sites=1 during `ray start` and `ray.init()`.
(Printer pid=653)
(Printer pid=653) The object has already been deleted by the reference counting protocol. This should not happen.
(Printer pid=653) Failed to retrieve object 00f777fc1bcbd0f2af58bd9c345a9854b6b6d3db06000000fbe3f505. To see information about where this ObjectRef was created in Python, set the environment variable RAY_record_ref_creation_sites=1 during `ray start` and `ray.init()`.
(Printer pid=653)
(Printer pid=653) The object has already been deleted by the reference counting protocol. This should not happen.
(Printer pid=653) Failed to retrieve object 00f777fc1bcbd0f2af58bd9c345a9854b6b6d3db0600000010e4f505. To see information about where this ObjectRef was created in Python, set the environment variable RAY_record_ref_creation_sites=1 during `ray start` and `ray.init()`.
(Printer pid=653)
(Printer pid=653) The object has already been deleted by the reference counting protocol. This should not happen.