Ray Data job hangs

Description:
The job appears to be hanging with the following symptoms:

  1. Dashboard Status:
  • Upstream operators of Map operator are completed
  • Input queue is empty
  • 6 blocks remain unprocessed (seemingly assigned to actors)
  • 6 actors remain in “alive” state
  1. Investigation Details:
  • Actor stacks show idle state
  • This condition has persisted for 4 hours
  • All actors are alive but not processing
  1. Hypothesis:
    There might be a potential issue where actors lose their input blocks after failure and restart. The actors remain alive but appear to have lost their assigned work.

Question:
Is it possible that actors lose their input parameters (e.g., input blocks) during failure recovery, leading to this hanging state?

Additional Context:
This appears to be a case where the execution state becomes inconsistent after actor restarts, where actors are alive but not making progress on their assigned blocks.

Hi @dragongu can you provide a reproducible example? Also would be great to get one on latest Ray

@rliaw Sorry, the job is quite complex and difficult to share completely. I’ve observed some abnormal phenomena but haven’t found the root cause

Using ray memory, I discovered some objects in USED_BY_PENDING_TASK and PENDING_NODE_ASSIGNMENT states, but there are no corresponding pending tasks or actors in the system.

2250246  Driver  disabled                FINISHED        2         31969459.0 B  USED_BY_PENDING_TASK  f0ac7566e69f760affffffffffffffffffffffff0400000002000000

2250246  Driver  disabled                PENDING_NODE_ASSIGNMENT    2         33114529.0 B  LOCAL_REFERENCE     bc65baec1b8684b74c9efedabb9cc4d7960105960400000002000000

some logs:

[2025-04-03 10:27:35,060 D 2250246 2267355] task_manager.cc:502: Generator bc65baec1b8684b74c9efedabb9cc4d7960105960400000001000000 still has lineage in scope, try again later
[2025-04-03 10:27:45,062 D 2250246 2267355] core_worker.cc:3487: TryDelObjectRefStream from generator_ids_pending_deletion_ object_id=bc65baec1b8684b74c9efedabb9cc4d7960105960400000001000000
[2025-04-03 10:27:45,062 D 2250246 2267355] task_manager.cc:502: Generator bc65baec1b8684b74c9efedabb9cc4d7960105960400000001000000 still has lineage in scope, try again later
[2025-04-03 10:27:55,065 D 2250246 2267355] core_worker.cc:3487: TryDelObjectRefStream from generator_ids_pending_deletion_ object_id=bc65baec1b8684b74c9efedabb9cc4d7960105960400000001000000
[2025-04-03 10:27:55,065 D 2250246 2267355] task_manager.cc:502: Generator bc65baec1b8684b74c9efedabb9cc4d7960105960400000001000000 still has lineage in scope, try again later
[2025-04-03 10:28:05,068 D 2250246 2267355] core_worker.cc:3487: TryDelObjectRefStream from generator_ids_pending_deletion_ object_id=bc65baec1b8684b74c9efedabb9cc4d7960105960400000001000000
[2025-04-03 10:28:05,068 D 2250246 2267355] task_manager.cc:502: Generator bc65baec1b8684b74c9efedabb9cc4d7960105960400000001000000 still has lineage in scope, try again later
[2025-04-03 10:28:15,070 D 2250246 2267355] core_worker.cc:3487: TryDelObjectRefStream from generator_ids_pending_deletion_ object_id=bc65baec1b8684b74c9efedabb9cc4d7960105960400000001000000