When running a simple Ray Data job where worker pods are frequently preempted, the following error occurs:
task_manager.cc:1412: Check failed: it->second.GetStatus() == rpc::TaskStatus::PENDING_NODE_ASSIGNMENT
task ID = 6ff6fe559f63b1b8b015cbfb3a695db0935ce25820000000 status = 1
It appears to be a multi-threaded state modification issue, but I haven’t located the exact line of code.
Versions
2.43.0