Description:
The job appears to be hanging with the following symptoms:
Dashboard Status:
Upstream operators of Map operator are completed
Input queue is empty
6 blocks remain unprocessed (seemingly assigned to actors)
6 actors remain in “alive” state
Investigation Details:
Actor stacks show idle state
This condition has persisted for 4 hours
All actors are alive but not processing
Hypothesis:
There might be a potential issue where actors lose their input blocks after failure and restart. The actors remain alive but appear to have lost their assigned work.
Question:
Is it possible that actors lose their input parameters (e.g., input blocks) during failure recovery, leading to this hanging state?
Additional Context:
This appears to be a case where the execution state becomes inconsistent after actor restarts, where actors are alive but not making progress on their assigned blocks.
@rliaw Sorry, the job is quite complex and difficult to share completely. I’ve observed some abnormal phenomena but haven’t found the root cause
Using ray memory, I discovered some objects in USED_BY_PENDING_TASK and PENDING_NODE_ASSIGNMENT states, but there are no corresponding pending tasks or actors in the system.
2250246 Driver disabled FINISHED 2 31969459.0 B USED_BY_PENDING_TASK f0ac7566e69f760affffffffffffffffffffffff0400000002000000
2250246 Driver disabled PENDING_NODE_ASSIGNMENT 2 33114529.0 B LOCAL_REFERENCE bc65baec1b8684b74c9efedabb9cc4d7960105960400000002000000
some logs:
[2025-04-03 10:27:35,060 D 2250246 2267355] task_manager.cc:502: Generator bc65baec1b8684b74c9efedabb9cc4d7960105960400000001000000 still has lineage in scope, try again later
[2025-04-03 10:27:45,062 D 2250246 2267355] core_worker.cc:3487: TryDelObjectRefStream from generator_ids_pending_deletion_ object_id=bc65baec1b8684b74c9efedabb9cc4d7960105960400000001000000
[2025-04-03 10:27:45,062 D 2250246 2267355] task_manager.cc:502: Generator bc65baec1b8684b74c9efedabb9cc4d7960105960400000001000000 still has lineage in scope, try again later
[2025-04-03 10:27:55,065 D 2250246 2267355] core_worker.cc:3487: TryDelObjectRefStream from generator_ids_pending_deletion_ object_id=bc65baec1b8684b74c9efedabb9cc4d7960105960400000001000000
[2025-04-03 10:27:55,065 D 2250246 2267355] task_manager.cc:502: Generator bc65baec1b8684b74c9efedabb9cc4d7960105960400000001000000 still has lineage in scope, try again later
[2025-04-03 10:28:05,068 D 2250246 2267355] core_worker.cc:3487: TryDelObjectRefStream from generator_ids_pending_deletion_ object_id=bc65baec1b8684b74c9efedabb9cc4d7960105960400000001000000
[2025-04-03 10:28:05,068 D 2250246 2267355] task_manager.cc:502: Generator bc65baec1b8684b74c9efedabb9cc4d7960105960400000001000000 still has lineage in scope, try again later
[2025-04-03 10:28:15,070 D 2250246 2267355] core_worker.cc:3487: TryDelObjectRefStream from generator_ids_pending_deletion_ object_id=bc65baec1b8684b74c9efedabb9cc4d7960105960400000001000000