How severe does this issue affect your experience of using Ray?
- High: It blocks me to complete my task.
Hi, I got the following error for a few workers on the same host
(raylet, ip=xx.xx.xx.xx) [2023-05-30 06:22:40,711 E 19383 19841] core_worker.cc:3453: Mismatched ActorID: ignoring KillActor for previous actor e1ff5d4bb50986d70bb3b67f09000000, current actor ID: NIL_ID
(raylet, ip=xx.xx.xx.xx) [2023-05-30 06:22:40,712 E 19381 19867] core_worker.cc:3453: Mismatched ActorID: ignoring KillActor for previous actor a188b27dc89171301117eb5f09000000, current actor ID: NIL_ID
(raylet, ip=xx.xx.xx.xx) [2023-05-30 06:22:40,716 E 19382 19898] core_worker.cc:3453: Mismatched ActorID: ignoring KillActor for previous actor 8ee6bbf1adf174446731555009000000, current actor ID: NIL_ID
(raylet, ip=xx.xx.xx.xx) [2023-05-30 06:22:40,718 E 19380 19891] core_worker.cc:3453: Mismatched ActorID: ignoring KillActor for previous actor f50b7be1c2a7a8727660714509000000, current actor ID: NIL_ID
(raylet, ip=xx.xx.xx.xx) *** SIGTERM received at time=1685427760 on cpu 124 ***
(raylet, ip=xx.xx.xx.xx) *** SIGTERM received at time=1685427760 on cpu 68 ***
(raylet, ip=xx.xx.xx.xx) *** SIGTERM received at time=1685427760 on cpu 6 ***
(raylet, ip=xx.xx.xx.xx) *** SIGTERM received at time=1685427760 on cpu 82 ***
(raylet, ip=xx.xx.xx.xx) PC: @ 0x7fd25171ea3d (unknown) syscall
(raylet, ip=xx.xx.xx.xx) PC: @ 0x7feea291ea3d (unknown) syscall
(raylet, ip=xx.xx.xx.xx) @ 0x7fd251642520 (unknown) (unknown)
(raylet, ip=xx.xx.xx.xx) @ 0x7feea2842520 (unknown) (unknown)
(raylet, ip=xx.xx.xx.xx) [2023-05-30 06:22:40,727 E 19380 19380] logging.cc:361: *** SIGTERM received at time=1685427760 on cpu 68 ***
(raylet, ip=xx.xx.xx.xx) [2023-05-30 06:22:40,727 E 19380 19380] logging.cc:361: PC: @ 0x7feea291ea3d (unknown) syscall
(raylet, ip=xx.xx.xx.xx) [2023-05-30 06:22:40,727 E 19380 19380] logging.cc:361: @ 0x7feea2842520 (unknown) (unknown)
(raylet, ip=xx.xx.xx.xx) PC: @ 0x7f9f0091ea3d (unknown) syscall
(raylet, ip=xx.xx.xx.xx) PC: @ 0x7ff67331ea3d (unknown) syscall
(raylet, ip=xx.xx.xx.xx) @ 0x7f9f00842520 (unknown) (unknown)
(raylet, ip=xx.xx.xx.xx) @ 0x7ff673242520 (unknown) (unknown)
(raylet, ip=xx.xx.xx.xx) [2023-05-30 06:22:40,727 E 19381 19381] logging.cc:361: *** SIGTERM received at time=1685427760 on cpu 82 ***
(raylet, ip=xx.xx.xx.xx) [2023-05-30 06:22:40,727 E 19381 19381] logging.cc:361: PC: @ 0x7f9f0091ea3d (unknown) syscall
(raylet, ip=xx.xx.xx.xx) [2023-05-30 06:22:40,727 E 19382 19382] logging.cc:361: *** SIGTERM received at time=1685427760 on cpu 6 ***
(raylet, ip=xx.xx.xx.xx) @ 0x441f0f000000 (unknown) (unknown)
(raylet, ip=xx.xx.xx.xx) [2023-05-30 06:22:40,727 E 19382 19382] logging.cc:361: PC: @ 0x7ff67331ea3d (unknown) syscall
(raylet, ip=xx.xx.xx.xx) [2023-05-30 06:22:40,727 E 19381 19381] logging.cc:361: @ 0x7f9f00842520 (unknown) (unknown)
(raylet, ip=xx.xx.xx.xx) [2023-05-30 06:22:40,727 E 19383 19383] logging.cc:361: *** SIGTERM received at time=1685427760 on cpu 124 ***
(raylet, ip=xx.xx.xx.xx) [2023-05-30 06:22:40,727 E 19383 19383] logging.cc:361: PC: @ 0x7fd25171ea3d (unknown) syscall
(raylet, ip=xx.xx.xx.xx) [2023-05-30 06:22:40,727 E 19382 19382] logging.cc:361: @ 0x7ff673242520 (unknown) (unknown)
(raylet, ip=xx.xx.xx.xx) [2023-05-30 06:22:40,727 E 19383 19383] logging.cc:361: @ 0x7fd251642520 (unknown) (unknown)
(raylet, ip=xx.xx.xx.xx) [2023-05-30 06:22:40,728 E 19383 19383] logging.cc:361: @ 0x441f0f000000 (unknown) (unknown)
A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. RayTask ID: ffffffffffffffffe1ff5d4bb50986d70bb3b67f09000000 Worker ID: 4bb2b590c522bd2f3fac7410ba4cdbb7d246309dea2ed5115490af05 Node ID: 64c04170b02e68878fc22bbb4c4473181b5dc9cc66893f3063e8bb76 Worker IP address: xx.xx.xx.xx Worker port: 10154 Worker PID: 19383 Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.
A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. RayTask ID: ffffffffffffffff8ee6bbf1adf174446731555009000000 Worker ID: 39c2060757162ac95d96d09b64320c13d126569c3c73a6c358cfca45 Node ID: 64c04170b02e68878fc22bbb4c4473181b5dc9cc66893f3063e8bb76 Worker IP address: xx.xx.xx.xx Worker port: 10158 Worker PID: 19382 Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.
A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. RayTask ID: fffffffffffffffff50b7be1c2a7a8727660714509000000 Worker ID: 11a2f5e6dd71b7f50c53175d756f9e90032493c8c4eecc0f2ef07de8 Node ID: 64c04170b02e68878fc22bbb4c4473181b5dc9cc66893f3063e8bb76 Worker IP address: xx.xx.xx.xx Worker port: 10156 Worker PID: 19380 Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.
A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. RayTask ID: ffffffffffffffffa188b27dc89171301117eb5f09000000 Worker ID: d74aa73662fa0ce06934e92af026b383c964dbbd0c5da2febb980759 Node ID: 64c04170b02e68878fc22bbb4c4473181b5dc9cc66893f3063e8bb76 Worker IP address: xx.xx.xx.xx Worker port: 10155 Worker PID: 19381 Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.
I don’t really understand where would be a good place to look into this. The suggestion about getting OOM-killed doesn’t really make sense to me because the amount of data ingested by the worker host is much much less than the amount of available RAM (on the order of maybe 1 GiB on a virtual host with 1 TiB of memory).