ray.exceptions.OwnerDiedError: Failed to retrieve object

mmuru · November 25, 2021, 4:29am

Using daily Ray build on Mac, manual cluster (1 head and 1 worker node) and tried to load data size 40M rows (ray.data.dataset). The data got loaded but throws the following errors

(DataLoadWorker pid=15003) [2021-11-24 17:20:05,789 E 15003 626393] core_worker.h:1110: Mismatched WorkerID: ignoring RPC for previous worker 01000000ffffffffffffffffffffffffffffffffffffffffffffffff, current worker ID: f56704bbf468f00ffacadf47bf7254d5fbd8d25a019e6d7c2a2331e5
(DataLoadWorker pid=15003) [2021-11-24 17:20:05,818 E 15003 626393] core_worker.h:1110: Mismatched WorkerID: ignoring RPC for previous worker 01000000ffffffffffffffffffffffffffffffffffffffffffffffff, current worker ID: f56704bbf468f00ffacadf47bf7254d5fbd8d25a019e6d7c2a2331e5

When I tried access the data, it throws the following exception

ERROR: ray::DataLoadWorker.get_pa_table() (pid=15004, ip=192.168.1.69, repr=<data_load_worker.DataLoadWorker object at 0x119848ed0>)
  At least one of the input arguments for this task could not be computed:
ray.exceptions.OwnerDiedError: Failed to retrieve object ffffffffffffffffffffffffffffffffffffffff01000000b4000000. To see information about where this ObjectRef was created in Python, set the environment variable RAY_record_ref_creation_sites=1 during `ray start` and `ray.init()`.

It works if I run it as a local mode (single node) and also works for the smaller workloads (in cluster mode)

Please, can someone explain what does RAY_record_ref_creation_sites=1 mean? How do I fix this issues?

mmuru · November 25, 2021, 3:52pm

@sangcho or someone from ray team: Please, provide your feedback on this issue? Thanks.

sangcho · November 26, 2021, 2:43pm

RAY_record_ref_creation_sites=1 means you can print the call site (where the object is created) by setting the environment variable. For example, if you start a Ray with ray start

RAY_record_ref_creation_sites=1 ray start --head
RAY_record_ref_creation_sites=1 python [driver]

then you can see where the object is created (it is disabled by default for the performance reason since it incurs high overhead).

So, you are saying you are seeing this issue when you use ray.dataset right? In this case, it is highly likely a bug. Would you mind creating an issue to Ray’s Github page?

mmuru · November 26, 2021, 2:57pm

@sangcho: Thanks for the clarification. Yes, I will create the issue.

daquang · July 7, 2022, 12:32am

Hi, I was wondering if this issue ever got resolved. I seem to have a similar issue, as detailed here: Local object store on worker nodes not working, worker plasma stays at 0%

Topic		Replies	Views
Error In loading data in ray.remote function using external cluster	0	222	March 5, 2024
`OwnerDiedError` if dataset owner actor handle get out of scope Ray Core	1	374	May 11, 2023
Local object store on worker nodes not working, worker plasma stays at 0% Ray Core	8	1478	August 31, 2022
Node fault tolerance in Ray Data Ray Data	2	64	January 10, 2025
ray.exceptions.ObjectLostError: Object xxx is lost due to node failure Ray Core	5	633	July 26, 2021

ray.exceptions.OwnerDiedError: Failed to retrieve object

Related topics