ray.exceptions.WorkerCrashedError

Getting this error when running a job on ray on k8s. Job runs for ~7 hours and it’s close to completion and decides to puke out. Any ideas what’s going on?

Also can anyone provide where the python-core-worker logs are stored?

ray.exceptions.WorkerCrashedError: The worker died unexpectedly while executing this task. Check python-core-worker-*.log files for more information.
1 Like

Hi, sorry to hear that :frowning: You can check /tmp/ray/ for the log files, in particular /tmp/ray/session-latest/ for the logs from the most recent session. More details here: Logging — Ray v2.0.0.dev0

We are running Ray with k8s. We can’t access the logs files on the pod because it fails so we get error: cannot exec into a container in a completed pod; current phase is Failed

We do see the logs to stdout using kubectl logs though, here’s the full traceback

Traceback (most recent call last):
  File "main.py", line 193, in <module>
    results = ray.get(futures)
  File "/usr/local/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 46, in wrapper
    return getattr(ray, func.__name__)(*args, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/ray/util/client/api.py", line 35, in get
    return self.worker.get(vals, timeout=timeout)
  File "/usr/local/lib/python3.7/site-packages/ray/util/client/worker.py", line 196, in get
    res = self._get(obj_ref, op_timeout)
  File "/usr/local/lib/python3.7/site-packages/ray/util/client/worker.py", line 219, in _get
    raise err
ray.exceptions.WorkerCrashedError: The worker died unexpectedly while executing this task. Check python-core-worker-*.log files for more information.
A worker died or was killed while executing task c4280fb54ad6b274ffffffffffffffffffffffff05000000.

Any ideas how to troubleshoot?

I see… @Dmitri any insights here on debugging with Kubernetes?

Logging is probably the most critical issue we haven’t addressed yet in adapting Ray for K8s. Posting a user guide for capturing logs is tracked here

@tgaddair could have some insight on debugging failures of long-running Ray jobs on K8s

Hey @Dmitri , we use Promtail + Loki to collect all logs from Ray workers so they can be used after a worker pod crashes.

Ah, yeah, we do more or less the same to collect logs internally at Anyscale.
For anyone else who lands here:
The key thing to know is that Ray logs live in /tmp/ray/session_latest/logs in the Ray container.
Lofg processing tools like promtail/Loki can be used to scrape and export these logs.
The docs for the upcoming Ray 2.0.0 will have some guidance on this.

Some useful docs here

The docs for the upcoming Ray 2.0.0 will have some guidance on this.

Thank @davidxia! Precisely the docs I was referring to.