ray.exceptions.WorkerCrashedError

Getting this error when running a job on ray on k8s. Job runs for ~7 hours and it’s close to completion and decides to puke out. Any ideas what’s going on?

Also can anyone provide where the python-core-worker logs are stored?

ray.exceptions.WorkerCrashedError: The worker died unexpectedly while executing this task. Check python-core-worker-*.log files for more information.

Hi, sorry to hear that :frowning: You can check /tmp/ray/ for the log files, in particular /tmp/ray/session-latest/ for the logs from the most recent session. More details here: Logging — Ray v2.0.0.dev0

We are running Ray with k8s. We can’t access the logs files on the pod because it fails so we get error: cannot exec into a container in a completed pod; current phase is Failed

We do see the logs to stdout using kubectl logs though, here’s the full traceback

Traceback (most recent call last):
  File "main.py", line 193, in <module>
    results = ray.get(futures)
  File "/usr/local/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 46, in wrapper
    return getattr(ray, func.__name__)(*args, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/ray/util/client/api.py", line 35, in get
    return self.worker.get(vals, timeout=timeout)
  File "/usr/local/lib/python3.7/site-packages/ray/util/client/worker.py", line 196, in get
    res = self._get(obj_ref, op_timeout)
  File "/usr/local/lib/python3.7/site-packages/ray/util/client/worker.py", line 219, in _get
    raise err
ray.exceptions.WorkerCrashedError: The worker died unexpectedly while executing this task. Check python-core-worker-*.log files for more information.
A worker died or was killed while executing task c4280fb54ad6b274ffffffffffffffffffffffff05000000.

Any ideas how to troubleshoot?

I see… @Dmitri any insights here on debugging with Kubernetes?

Logging is probably the most critical issue we haven’t addressed yet in adapting Ray for K8s. Posting a user guide for capturing logs is tracked here

@tgaddair could have some insight on debugging failures of long-running Ray jobs on K8s