ray version is 2.6.3 , using ray.data.datasets.map
job died with ray exception below ;
Traceback (most recent call last):
File “pipeline_big_data.py”, line 128, in
for row in ds.iter_rows():
File “/opt/conda/envs/myenv/lib/python3.8/site-packages/ray/data/iterator.py”, line 237, in iter_rows
for batch in self.iter_batches(**iter_batch_args):
File “/opt/conda/envs/myenv/lib/python3.8/site-packages/ray/data/iterator.py”, line 189, in iter_batches
yield from iter_batches(
File “/opt/conda/envs/myenv/lib/python3.8/site-packages/ray/data/_internal/block_batching/iter_batches.py”, line 176, in iter_batches
next_batch = next(async_batch_iter)
File “/opt/conda/envs/myenv/lib/python3.8/site-packages/ray/data/_internal/block_batching/util.py”, line 289, in make_async_gen
raise next_item
File “/opt/conda/envs/myenv/lib/python3.8/site-packages/ray/data/_internal/block_batching/util.py”, line 266, in execute_computation
for item in fn(thread_safe_generator):
File “/opt/conda/envs/myenv/lib/python3.8/site-packages/ray/data/_internal/block_batching/iter_batches.py”, line 167, in _async_iter_batches
yield from extract_data_from_batch(batch_iter)
File “/opt/conda/envs/myenv/lib/python3.8/site-packages/ray/data/_internal/block_batching/util.py”, line 210, in extract_data_from_batch
for batch in batch_iter:
File “/opt/conda/envs/myenv/lib/python3.8/site-packages/ray/data/_internal/block_batching/iter_batches.py”, line 306, in restore_original_order
for batch in batch_iter:
File “/opt/conda/envs/myenv/lib/python3.8/site-packages/ray/data/_internal/block_batching/iter_batches.py”, line 218, in threadpool_computations_format_collate
yield from formatted_batch_iter
File “/opt/conda/envs/myenv/lib/python3.8/site-packages/ray/data/_internal/block_batching/util.py”, line 158, in format_batches
for batch in block_iter:
File “/opt/conda/envs/myenv/lib/python3.8/site-packages/ray/data/_internal/block_batching/util.py”, line 117, in blocks_to_batches
for block in block_iter:
File “/opt/conda/envs/myenv/lib/python3.8/site-packages/ray/data/_internal/block_batching/util.py”, line 54, in resolve_block_refs
for block_ref in block_ref_iter:
File “/opt/conda/envs/myenv/lib/python3.8/site-packages/ray/data/_internal/block_batching/iter_batches.py”, line 254, in prefetch_batches_locally
for block_ref, metadata in block_ref_iter:
File “/opt/conda/envs/myenv/lib/python3.8/site-packages/ray/data/_internal/block_batching/util.py”, line 246, in next
return next(self.it)
File “/opt/conda/envs/myenv/lib/python3.8/site-packages/ray/data/_internal/execution/legacy_compat.py”, line 51, in execute_to_legacy_block_iterator
for bundle in bundle_iter:
File “/opt/conda/envs/myenv/lib/python3.8/site-packages/ray/data/_internal/execution/interfaces.py”, line 548, in next
return self.get_next()
File “/opt/conda/envs/myenv/lib/python3.8/site-packages/ray/data/_internal/execution/streaming_executor.py”, line 129, in get_next
raise item
File “/opt/conda/envs/myenv/lib/python3.8/site-packages/ray/data/_internal/execution/streaming_executor.py”, line 187, in run
while self._scheduling_loop_step(self._topology) and not self._shutdown:
File “/opt/conda/envs/myenv/lib/python3.8/site-packages/ray/data/_internal/execution/streaming_executor.py”, line 235, in _scheduling_loop_step
process_completed_tasks(topology)
File “/opt/conda/envs/myenv/lib/python3.8/site-packages/ray/data/_internal/execution/streaming_executor_state.py”, line 333, in process_completed_tasks
op.notify_work_completed(ref)
File “/opt/conda/envs/myenv/lib/python3.8/site-packages/ray/data/_internal/execution/operators/actor_pool_map_operator.py”, line 219, in notify_work_completed
task.output = self._map_ref_to_ref_bundle(ref)
File “/opt/conda/envs/myenv/lib/python3.8/site-packages/ray/data/_internal/execution/operators/map_operator.py”, line 357, in _map_ref_to_ref_bundle
all_refs = list(ray.get(ref))
File “/opt/conda/envs/myenv/lib/python3.8/site-packages/ray/_private/auto_init_hook.py”, line 24, in auto_init_wrapper
return fn(*args, **kwargs)
File “/opt/conda/envs/myenv/lib/python3.8/site-packages/ray/_private/client_mode_hook.py”, line 103, in wrapper
return func(*args, **kwargs)
File “/opt/conda/envs/myenv/lib/python3.8/site-packages/ray/_private/worker.py”, line 2526, in get
raise value
ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task.
And my ray driver worker node’ log below,it seems meet network error when submit job,but not retry
[2024-03-12 21:24:32,523 I 101 671] task_manager.cc:831: task 04eef53e7d8b29573e5f449cfb8db4d80a93a1c805000000 retries left: 0, oom retries left: 0, task failed due to oom: 0
[2024-03-12 21:24:32,523 I 101 671] task_manager.cc:847: No retries left for task 04eef53e7d8b29573e5f449cfb8db4d80a93a1c805000000, not going to resubmit.
[2024-03-12 21:24:32,523 I 101 671] direct_actor_task_submitter.cc:563: PushActorTask failed because of network error, this task will be stashed away and waiting for Death info from GCS, task_id=04eef53e7d8b29573e5f449cfb8db4d80a93a1c805000000, wait_queue_size=1
[2024-03-12 21:24:32,526 I 101 671] actor_manager.cc:214: received notification on actor, state: DEAD, actor_id: 3e5f449cfb8db4d80a93a1c805000000, ip address: 21.29.121.163, port: 10026, worker_id: afc4b052cc8eeb7afec8041ba5f937e97f06d8f6e5a421a0942e64f4, raylet_id: 96461e5de9e94bef3f4fb84cb561b33225280b2e041c30ce80ecf372, num_restarts: 0, death context type=ActorDiedErrorContext
[2024-03-12 21:24:32,526 I 101 671] direct_actor_task_submitter.cc:286: Failing pending tasks for actor 3e5f449cfb8db4d80a93a1c805000000 because the actor is already dead.
[2024-03-12 21:24:32,526 I 101 671] task_manager.cc:899: Task failed: Type=ACTOR_TASK, Language=PYTHON, Resources: {}, function_descriptor={type=PythonFunctionDescriptor, module_name=ray.dashboard.modules.job.job_manager, class_name=JobSupervisor, function_name=run, function_hash=}, task_id=04eef53e7d8b29573e5f449cfb8db4d80a93a1c805000000, task_name=JobSupervisor.run, job_id=05000000, num_args=4, num_returns=1, depth=1, attempt_number=0, actor_task_spec={actor_id=3e5f449cfb8db4d80a93a1c805000000, actor_caller_id=ffffffffffffffffffffffffffffffffffffffff05000000, actor_counter=0}Preformatted text