Distinguishing between two causes for worker death

How severe does this issue affect your experience of using Ray?

  • Low: It annoys or frustrates me for a moment.

Hello everyone,

I am still a beginner when it comes to using Ray but I am loving it so far!
I am currently using Ray locally on my machine for parallel processing, using ray.init() and ray.remote() functions.
Sometimes, my worker processes die. This is due to segmentation faults (I use code from external libraries and this is outside my control). When that happens, the driver process prints out this message:

(raylet) A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. RayTask ID: 348024dc0f90c07d3664b7d03789bacbb94ab71a01000000 Worker ID: 1dee067ccfb59803e4f0e9d017756125e2eb61b02dd30c07ebfcfce0 Node ID: 3cf007159321fe7b43c2c0b4a3f4596d84e8b1b3c4f43cf72a448feb Worker IP address: 193.136.178.219 Worker port: 36273 Worker PID: 3891954 Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.

Here, the worker logs show these lines at the end:

[2024-08-13 11:27:19,837 E 3891954 3891954] logging.cc:440: *** SIGSEGV received at time=1723544839 on cpu 24 ***
[2024-08-13 11:27:19,837 E 3891954 3891954] logging.cc:440: PC: @ 0x7467beec7947 (unknown) (unknown)
[2024-08-13 11:27:19,839 E 3891954 3891954] logging.cc:440: @ 0x746a315c0a89 64 absl::lts_20230802::AbslFailureSignalHandler()
[2024-08-13 11:27:19,839 E 3891954 3891954] logging.cc:440: @ 0x746a32a42520 (unknown) (unknown)

However, some other times I get this message:

(raylet) A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. RayTask ID: fcb973917d779576e6c06fd1c1e25f02f29a7c3201000000 Worker ID: f782d2a587cb592c2388f37973e97e9feb0a9ef05f1ece04ee8b4a3c Node ID: 3cf007159321fe7b43c2c0b4a3f4596d84e8b1b3c4f43cf72a448feb Worker IP address: 193.136.178.219 Worker port: 45545 Worker PID: 3848725 Worker exit type: SYSTEM_ERROR Worker exit detail: The leased worker has unrecoverable failure. Worker is requested to be destroyed when it is returned. RPC Error message: Socket closed; RPC Error details:

Looking at the worker logs, I see these lines at the end:

[2024-08-13 11:05:21,807 E 3848725 3848725] logging.cc:440: *** SIGSEGV received at time=1723543521 on cpu 1 ***
[2024-08-13 11:05:21,807 E 3848725 3848725] logging.cc:440: PC: @ 0x708351c1ed86 (unknown) (unknown)
[2024-08-13 11:05:21,809 E 3848725 3848725] logging.cc:440: @ 0x7085d4dc0a89 64 absl::lts_20230802::AbslFailureSignalHandler()
[2024-08-13 11:05:21,809 E 3848725 3848725] logging.cc:440: @ 0x7085d6242520 (unknown) (unknown)

The cause of worker death seems to be the same (SIGSEGV) signal, but the error messages are different.
I do not really understand the difference in scenarios and was hoping someone could enlighten me so I can make sure my workers are dying due to expected circumstances and not something else.

Thank you for the assistance