ray.exceptions.WorkerCrashedError

blu · June 25, 2021, 11:57pm

Getting this error when running a job on ray on k8s. Job runs for ~7 hours and it’s close to completion and decides to puke out. Any ideas what’s going on?

Also can anyone provide where the python-core-worker logs are stored?

ray.exceptions.WorkerCrashedError: The worker died unexpectedly while executing this task. Check python-core-worker-*.log files for more information.

architkulkarni · June 28, 2021, 5:54pm

Hi, sorry to hear that You can check /tmp/ray/ for the log files, in particular /tmp/ray/session-latest/ for the logs from the most recent session. More details here: Logging — Ray v2.0.0.dev0

blu · June 28, 2021, 6:37pm

We are running Ray with k8s. We can’t access the logs files on the pod because it fails so we get error: cannot exec into a container in a completed pod; current phase is Failed

We do see the logs to stdout using kubectl logs though, here’s the full traceback

Traceback (most recent call last):
  File "main.py", line 193, in <module>
    results = ray.get(futures)
  File "/usr/local/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 46, in wrapper
    return getattr(ray, func.__name__)(*args, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/ray/util/client/api.py", line 35, in get
    return self.worker.get(vals, timeout=timeout)
  File "/usr/local/lib/python3.7/site-packages/ray/util/client/worker.py", line 196, in get
    res = self._get(obj_ref, op_timeout)
  File "/usr/local/lib/python3.7/site-packages/ray/util/client/worker.py", line 219, in _get
    raise err
ray.exceptions.WorkerCrashedError: The worker died unexpectedly while executing this task. Check python-core-worker-*.log files for more information.
A worker died or was killed while executing task c4280fb54ad6b274ffffffffffffffffffffffff05000000.

Any ideas how to troubleshoot?

architkulkarni · June 28, 2021, 9:08pm

I see… @Dmitri any insights here on debugging with Kubernetes?

Dmitri · June 29, 2021, 2:59am

Logging is probably the most critical issue we haven’t addressed yet in adapting Ray for K8s. Posting a user guide for capturing logs is tracked here

Dmitri · June 29, 2021, 3:06am

@tgaddair could have some insight on debugging failures of long-running Ray jobs on K8s

tgaddair · August 13, 2022, 6:26pm

Hey @Dmitri , we use Promtail + Loki to collect all logs from Ray workers so they can be used after a worker pod crashes.

Dmitri · August 13, 2022, 11:24pm

Ah, yeah, we do more or less the same to collect logs internally at Anyscale.
For anyone else who lands here:
The key thing to know is that Ray logs live in /tmp/ray/session_latest/logs in the Ray container.
Lofg processing tools like promtail/Loki can be used to scrape and export these logs.
The docs for the upcoming Ray 2.0.0 will have some guidance on this.

davidxia · August 22, 2022, 4:26pm

Some useful docs here

Dmitri · August 22, 2022, 5:09pm

The docs for the upcoming Ray 2.0.0 will have some guidance on this.

Thank @davidxia! Precisely the docs I was referring to.

Topic		Replies	Views
[Ray K8s cluster] - Script exit	0	304	July 8, 2023
Raylet worker processes are failing Ray Core	3	62	March 5, 2025
Broken Pipe Error Ray Core	2	1401	May 27, 2022
Tasks are completed but ray.exceptions.WorkerCrashedError Kubernetes	4	1681	May 4, 2022
Ray head and ray training worker pods are crashing intermittently Kubernetes	3	143	August 9, 2024

ray.exceptions.WorkerCrashedError

Related topics