[Core] Task Status Check Failure in Ray Data Job with Preempted Workers

dragongu · April 21, 2025, 11:05am

When running a simple Ray Data job where worker pods are frequently preempted, the following error occurs:

task_manager.cc:1412: Check failed: it->second.GetStatus() == rpc::TaskStatus::PENDING_NODE_ASSIGNMENT
task ID = 6ff6fe559f63b1b8b015cbfb3a695db0935ce25820000000 status = 1

It appears to be a multi-threaded state modification issue, but I haven’t located the exact line of code.

Versions

2.43.0

christina · April 22, 2025, 10:47pm

Hi dragongu, do you have a way to recreate this possibly? There could be a few diff reasons this might be happening. Besides this error, are there any other errors happening? how is the resources (CPU / GPU / mem) for the tasks you’re running, like is it sufficient?

dragongu · April 23, 2025, 2:43am

It is reproducible with abundant resources. For related issues, please refer to: [Core] task_manager.cc:1416: Check failed: it->second.GetStatus() == rpc::TaskStatus::PENDING_NODE_ASSIGNMENT · Issue #52530 · ray-project/ray · GitHub

Topic		Replies	Views
Ray tasks lost on node failiure, how to debug? Ray Core	5	633	June 17, 2021
Pending tasks not starting up Kubernetes	7	1487	May 13, 2022
Ray tasks sometimes hang in PENDING_NODE_ASSIGNMENT Ray Core	12	1578	January 9, 2023
Subset of tasks stuck in "PENDING_NODE_ASSIGNMENT" forever Ray Clusters	9	2142	May 25, 2023
Failed Tasks Debugging Ray Core	1	343	March 20, 2023

[Core] Task Status Check Failure in Ray Data Job with Preempted Workers

Versions

Related topics