How severe does this issue affect your experience of using Ray?
- Medium: It contributes to significant difficulty to complete my task, but I can work around it.
Has anyone ever encounter such a situation? Here are some relevant logs showing the supposedly dead worker (PID 253) is still running after raylet reports it is dead and has spun up a new worker to retry the task. Also, I verified the dead worker was still running externally because i can see requests coming from it in another service it was calling.
(_ray_query pid=253, ip=10.216.195.5) 2024-02-28 11:43:29,024|INFO|.... app logging
(raylet) A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. RayTask ID: 03ed4e0da3e7d01091cab6ee06591ae5a492dfee35000000 Worker ID: 553663a8f8d04d87d6582249470b6c814e18ee1d8b8a5ff020c69393 Node ID: cdd464a4d5397e4f15f082f65fbeb0e42f4472cd03ad71e30e9554d0 Worker IP address: 10.216.195.5 Worker port: 10002 Worker PID: 253 Worker exit type: SYSTEM_ERROR Worker exit detail: The leased worker has unrecoverable failure. Worker is requested to be destroyed when it is returned. RPC Error message: Socket closed; RPC Error details:
(raylet) A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. RayTask ID: ffffffffffffffffb3ec3185ceeb075b483f763835000000 Worker ID: 2d9b8f403a7be2d68d5372a5bbbd4146bf66ab440f74153c2c40cbfe Node ID: cdd464a4d5397e4f15f082f65fbeb0e42f4472cd03ad71e30e9554d0 Worker IP address: 10.216.195.5 Worker port: 10003 Worker PID: 367 Worker exit type: SYSTEM_ERROR Worker exit detail: Worker exits unexpectedly by a signal. SystemExit is raised (sys.exit is called). Exit code: 1. The process receives a SIGTERM.
(_ray_query pid=963, ip=10.216.195.5) 2024-02-28 11:44:05,771|INFO|.... app logging
(_ray_query pid=963, ip=10.216.195.5) 2024-02-28 11:44:06,108|INFO|.... app logging
(_ray_query pid=253, ip=10.216.195.5) 2024-02-28 11:44:13,159|INFO|.... app logging
Also, this worker PID 253 kept running for about an hour after I stopped all jobs on the cluster. Eventually, the worker node died on its own.
Update: This seems to consistently happen, the raylet will report the worker is dead after one long running task executes for about 1 hour 10mins. Guess we’re hitting some timeout?