Ray worker died from unrecoverable error but it actually keeps running

How severe does this issue affect your experience of using Ray?

  • Medium: It contributes to significant difficulty to complete my task, but I can work around it.

Has anyone ever encounter such a situation? Here are some relevant logs showing the supposedly dead worker (PID 253) is still running after raylet reports it is dead and has spun up a new worker to retry the task. Also, I verified the dead worker was still running externally because i can see requests coming from it in another service it was calling.

(_ray_query pid=253, ip=10.216.195.5) 2024-02-28 11:43:29,024|INFO|.... app logging

(raylet) A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. RayTask ID: 03ed4e0da3e7d01091cab6ee06591ae5a492dfee35000000 Worker ID: 553663a8f8d04d87d6582249470b6c814e18ee1d8b8a5ff020c69393 Node ID: cdd464a4d5397e4f15f082f65fbeb0e42f4472cd03ad71e30e9554d0 Worker IP address: 10.216.195.5 Worker port: 10002 Worker PID: 253 Worker exit type: SYSTEM_ERROR Worker exit detail: The leased worker has unrecoverable failure. Worker is requested to be destroyed when it is returned. RPC Error message: Socket closed; RPC Error details: 

(raylet) A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. RayTask ID: ffffffffffffffffb3ec3185ceeb075b483f763835000000 Worker ID: 2d9b8f403a7be2d68d5372a5bbbd4146bf66ab440f74153c2c40cbfe Node ID: cdd464a4d5397e4f15f082f65fbeb0e42f4472cd03ad71e30e9554d0 Worker IP address: 10.216.195.5 Worker port: 10003 Worker PID: 367 Worker exit type: SYSTEM_ERROR Worker exit detail: Worker exits unexpectedly by a signal. SystemExit is raised (sys.exit is called). Exit code: 1. The process receives a SIGTERM.

(_ray_query pid=963, ip=10.216.195.5) 2024-02-28 11:44:05,771|INFO|.... app logging

(_ray_query pid=963, ip=10.216.195.5) 2024-02-28 11:44:06,108|INFO|.... app logging

(_ray_query pid=253, ip=10.216.195.5) 2024-02-28 11:44:13,159|INFO|.... app logging

Also, this worker PID 253 kept running for about an hour after I stopped all jobs on the cluster. Eventually, the worker node died on its own.

Update: This seems to consistently happen, the raylet will report the worker is dead after one long running task executes for about 1 hour 10mins. Guess we’re hitting some timeout?

@yic could you take a look?

Hi @Jimmy_Cao it seems like a bug. Do you minding sharing your script for this?

If not, can you share the raylet logs and the the dead worker logs ? They should be in /tmp/ray/session_latest/ on your local node.

Thanks! I can’t share the exact script but let me see if i can share a minimal repro. I don’t have the logs anymore as i had to restart my cluster

This reproduces the bug, I’m not sure if the Actor is needed:

import time

from collections import defaultdict
from typing import DefaultDict

import ray


@ray.remote(num_cpus=1)
class StatsActor:
    def __init__(self) -> None:
        self._stats: DefaultDict[str, int] = defaultdict(int)

    def increment(self, stat_key: str) -> None:
        self._stats[stat_key] += 1


@ray.remote
def long_running_task():
  actor_ref = StatsActor.remote()
  for idx in range(2*60*60): # run for ~2hours, should fail around 1h 10mins
    if idx % 100 == 0:
        print("im alive!")
    for _ in range(30):
      # do some work at 30qps and incr a counter
      time.sleep(1/30) 
      actor_ref.increment.remote("somekey")

ray.init()
ray.get(long_running_task.remote(), timeout=3*60*60) # timeout longer than task run time