ray.exceptions.OwnerDiedError: Failed to retrieve object

Using daily Ray build on Mac, manual cluster (1 head and 1 worker node) and tried to load data size 40M rows (ray.data.dataset). The data got loaded but throws the following errors

(DataLoadWorker pid=15003) [2021-11-24 17:20:05,789 E 15003 626393] core_worker.h:1110: Mismatched WorkerID: ignoring RPC for previous worker 01000000ffffffffffffffffffffffffffffffffffffffffffffffff, current worker ID: f56704bbf468f00ffacadf47bf7254d5fbd8d25a019e6d7c2a2331e5
(DataLoadWorker pid=15003) [2021-11-24 17:20:05,818 E 15003 626393] core_worker.h:1110: Mismatched WorkerID: ignoring RPC for previous worker 01000000ffffffffffffffffffffffffffffffffffffffffffffffff, current worker ID: f56704bbf468f00ffacadf47bf7254d5fbd8d25a019e6d7c2a2331e5

When I tried access the data, it throws the following exception

ERROR: ray::DataLoadWorker.get_pa_table() (pid=15004, ip=192.168.1.69, repr=<data_load_worker.DataLoadWorker object at 0x119848ed0>)
  At least one of the input arguments for this task could not be computed:
ray.exceptions.OwnerDiedError: Failed to retrieve object ffffffffffffffffffffffffffffffffffffffff01000000b4000000. To see information about where this ObjectRef was created in Python, set the environment variable RAY_record_ref_creation_sites=1 during `ray start` and `ray.init()`.

It works if I run it as a local mode (single node) and also works for the smaller workloads (in cluster mode)

Please, can someone explain what does RAY_record_ref_creation_sites=1 mean? How do I fix this issues?

@sangcho or someone from ray team: Please, provide your feedback on this issue? Thanks.

RAY_record_ref_creation_sites=1 means you can print the call site (where the object is created) by setting the environment variable. For example, if you start a Ray with ray start

RAY_record_ref_creation_sites=1 ray start --head
RAY_record_ref_creation_sites=1 python [driver]

then you can see where the object is created (it is disabled by default for the performance reason since it incurs high overhead).

So, you are saying you are seeing this issue when you use ray.dataset right? In this case, it is highly likely a bug. Would you mind creating an issue to Ray’s Github page?

@sangcho: Thanks for the clarification. Yes, I will create the issue.

Hi, I was wondering if this issue ever got resolved. I seem to have a similar issue, as detailed here: Local object store on worker nodes not working, worker plasma stays at 0%