1. Severity of the issue: (select one)
High: Completely blocks me.
2. Environment:
- Ray version: 2.23
- Python version: 3.10
- OS:
- Cloud/Infrastructure: managed k8s
- Other libs/tools (if relevant):
3. What happened vs. what you expected:
- Expected: MapBatches(BatchPredictor) should run for all the blocks.
- Actual: Abruptly failing with ObjectFetchTimedOutError after running for 6hours. Each parquet file is of apprx 14 mb, and we have huge number of files (apprx 17K) .
event_log":" File \"/home/ray/anaconda3/lib/python3.10/site-packages/ray/data/dataset.py\", line 4625, in materialize"}
event_log":" File \"/home/ray/anaconda3/lib/python3.10/site-packages/ray/data/exceptions.py\", line 86, in handle_trace"}
event_log":"2025-06-07 08:21:45,736\tINFO cli.py:83 -- Status message: Job entrypoint command failed with exit code 1, last available logs (truncated to 20,000 chars):"}
event_log":"ray.exceptions.ObjectFetchTimedOutError: Failed to retrieve object 1ae8e0c85369413affffffffffffffffffffffff0400000002000000. To see information about where this ObjectRef was created in Python, set the environment variable RAY_record_ref_creation_sites=1 during `ray start` and `ray.init()`."}
event_log":"\u001b[36m(MapWorker(MapBatches(BatchPredictor)) pid=9532, )\u001b[0m length of batch 256\u001b[32m [repeated 7x across cluster]\u001b[0m"}
event_log":"Fetch for object 1ae8e0c85369413affffffffffffffffffffffff0400000002000000 timed out because no locations were found for the object. This may indicate a system-level bug."}
event_log":"2025-06-07 08:21:45,736\tERR cli.py:70 -- \u001b[31m---------------------------------------------\u001b[39m"}
event_log":" raise e.with_traceback(None) from SystemException()"}
event_log":" copy._plan.execute(force_read=True)"}
event_log":"ray.exceptions.RayTaskError(ObjectFetchTimedOutError): \u001b[36mray::MapBatches(BatchPredictor)()\u001b[39m (pid=9288, ip=, actor_id=6a0d3d23a3c4bbc5c91b323804000000, repr=MapWorker(MapBatches(BatchPredictor)))"}
event_log":" At least one of the input arguments for this task could not be computed:"}
event_log":" File \"/app/inference.py\", line 233, in <module>"}
event_log":" raise e.with_traceback(None) from SystemException()"}
event_log":" copy._plan.execute(force_read=True)"}
event_log":"The above exception was the direct cause of the following exception:"}
event_log":"2025-06-07 08:21:45,735\tERR cli.py:68 -- \u001b[31m---------------------------------------------\u001b[39m"}
event_log":"ray.exceptions.RayTaskError(ObjectFetchTimedOutError): \u001b[36mray::MapBatches(BatchPredictor)()\u001b[39m (pid=9288, ip=, actor_id=6a0d3d23a3c4bbc5c91b323804000000, repr=MapWorker(MapBatches(BatchPredictor)))"}
event_log":" (ds.map_batches(BatchPredictor,"}
event_log":"- Write: 0 active, 0 queued, [cpu: 0.0, objects: 0.0B]: 100%|█████████▉| 304/305 [6:07:43<01:40, 100.38s/it] \u001b[A\u001b[A\u001b[A"}
event_log":" File \"/home/ray/anaconda3/lib/python3.10/site-packages/ray/data/dataset.py\", line 2818, in write_parquet"}
event_log":" self.write_datasink("}
event_log":"\u001b[36m(MapWorker(MapBatches(BatchPredictor)) pid=9532, ip=)\u001b[0m length of batch 256\u001b[32m [repeated 7x across cluster]\u001b[0m"}
event_log":" \u001b[A\u001b[A\u001b[A2025-06-07 08:21:40,093\tERROR exceptions.py:73 -- Exception occurred in Ray Data or Ray Core internal code. If you continue to see this error, please open an issue on the Ray project GitHub page with the full stack trace below: https://github.com/ray-project/ray/issues/new/choose"}