When does a `Worker` fail to set `core_worker`?

How severe does this issue affect your experience of using Ray?

  • High: It blocks me to complete my task.

Dear Ray community,
I have this error

(raylet) A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. RayTask ID: ffffffffffffffff809f99822304543a1e3cced901000000 Worker ID: d653a51c0223abe8aa902ab8067201e7f2fcbc8c6b89fcbe93150737 Node ID: 16d6c7e905ea8267d00a2779373ed4e0a2e17bd874f8b0b801c93033 Worker IP address: 127.0.0.1 Worker port: 62559 Worker PID: 23061 Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.
(TorchTrainer pid=23052) Worker 1 has failed.
(RayTrainWorker pid=23062) [rank1]: Traceback (most recent call last):
(RayTrainWorker pid=23062) [rank1]:   File "python/ray/_raylet.pyx", line 2251, in ray._raylet.task_execution_handler
(RayTrainWorker pid=23062) [rank1]:   File "python/ray/_raylet.pyx", line 2082, in ray._raylet.execute_task_with_cancellation_handler
(RayTrainWorker pid=23062) [rank1]: AttributeError: 'Worker' object has no attribute 'core_worker'
(RayTrainWorker pid=23062)
(RayTrainWorker pid=23062) [rank1]: During handling of the above exception, another exception occurred:
(RayTrainWorker pid=23062)
(RayTrainWorker pid=23062) [rank1]: Traceback (most recent call last):
(RayTrainWorker pid=23062) [rank1]:   File "python/ray/_raylet.pyx", line 2290, in ray._raylet.task_execution_handler
(RayTrainWorker pid=23062) [rank1]:   File "/Users/mdenadai/.local/share/virtualenvs/mine-ec3snymA/lib/python3.12/site-packages/ray/_private/utils.py", line 178, in push_error_to_driver
(RayTrainWorker pid=23062) [rank1]:     worker.core_worker.push_error(job_id, error_type, message, time.time())
(RayTrainWorker pid=23062) [rank1]:     ^^^^^^^^^^^^^^^^^^
(RayTrainWorker pid=23062) [rank1]: AttributeError: 'Worker' object has no attribute 'core_worker'
(RayTrainWorker pid=23062) Exception ignored in: 'ray._raylet.task_execution_handler'
(RayTrainWorker pid=23062) Traceback (most recent call last):
(RayTrainWorker pid=23062)   File "python/ray/_raylet.pyx", line 2290, in ray._raylet.task_execution_handler
(RayTrainWorker pid=23062)   File "/Users/mdenadai/.local/share/virtualenvs/mine-ec3snymA/lib/python3.12/site-packages/ray/_private/utils.py", line 178, in push_error_to_driver
(RayTrainWorker pid=23062)     worker.core_worker.push_error(job_id, error_type, message, time.time())
(RayTrainWorker pid=23062)     ^^^^^^^^^^^^^^^^^^
(RayTrainWorker pid=23062) AttributeError: 'Worker' object has no attribute 'core_worker'
(RayTrainWorker pid=23062) [2024-09-30 14:32:06,779 C 23062 2807883] task_receiver.cc:213:  Check failed: objects_valid

which means that a Worker does not have a core_worker set. When does it happen? I see that core_worker is set by a connect function here ray/python/ray/_private/worker.py at 073d143c62e24f931812c6f27243974506a7049c · ray-project/ray · GitHub but why do I have an error? It means that we do not use this function somewhere, right?

Thx

This seems like a bug, and it usually happens upon shutdown. What version of ray are you using, and is there a way to repro this?

the latest one (here the issue in github) Ray core: `AttributeError: 'Worker' object has no attribute 'core_worker'` · Issue #47759 · ray-project/ray · GitHub. I cannot release the code publicly, unfortunately… Is there something I can do to help you debugging the problem? Or how can I debug it?

Unfortunately, this is a pretty tricky issue. The issue is that the program is exiting, and the destruction is not properly coordinated, and it can happen.

One thing that could be really helpful is

  1. find the first version that doesn’t have this issue (to check if it is regression)
  2. Create a minimal repro without your code and create an issue.