[Core] Task Status Check Failure in Ray Data Job with Preempted Workers

When running a simple Ray Data job where worker pods are frequently preempted, the following error occurs:

task_manager.cc:1412: Check failed: it->second.GetStatus() == rpc::TaskStatus::PENDING_NODE_ASSIGNMENT
task ID = 6ff6fe559f63b1b8b015cbfb3a695db0935ce25820000000 status = 1

It appears to be a multi-threaded state modification issue, but I haven’t located the exact line of code.

Versions

2.43.0

Hi dragongu, do you have a way to recreate this possibly? There could be a few diff reasons this might be happening. Besides this error, are there any other errors happening? how is the resources (CPU / GPU / mem) for the tasks you’re running, like is it sufficient?

It is reproducible with abundant resources. For related issues, please refer to: [Core] task_manager.cc:1416: Check failed: it->second.GetStatus() == rpc::TaskStatus::PENDING_NODE_ASSIGNMENT · Issue #52530 · ray-project/ray · GitHub