Getting this error when running a job on ray on k8s. Job runs for ~7 hours and it’s close to completion and decides to puke out. Any ideas what’s going on?
Also can anyone provide where the python-core-worker logs are stored?
ray.exceptions.WorkerCrashedError: The worker died unexpectedly while executing this task. Check python-core-worker-*.log files for more information.
Hi, sorry to hear that You can check /tmp/ray/ for the log files, in particular /tmp/ray/session-latest/ for the logs from the most recent session. More details here: Logging — Ray v2.0.0.dev0
We are running Ray with k8s. We can’t access the logs files on the pod because it fails so we get error: cannot exec into a container in a completed pod; current phase is Failed
We do see the logs to stdout using kubectl logs though, here’s the full traceback
Traceback (most recent call last):
File "main.py", line 193, in <module>
results = ray.get(futures)
File "/usr/local/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 46, in wrapper
return getattr(ray, func.__name__)(*args, **kwargs)
File "/usr/local/lib/python3.7/site-packages/ray/util/client/api.py", line 35, in get
return self.worker.get(vals, timeout=timeout)
File "/usr/local/lib/python3.7/site-packages/ray/util/client/worker.py", line 196, in get
res = self._get(obj_ref, op_timeout)
File "/usr/local/lib/python3.7/site-packages/ray/util/client/worker.py", line 219, in _get
raise err
ray.exceptions.WorkerCrashedError: The worker died unexpectedly while executing this task. Check python-core-worker-*.log files for more information.
A worker died or was killed while executing task c4280fb54ad6b274ffffffffffffffffffffffff05000000.
Ah, yeah, we do more or less the same to collect logs internally at Anyscale.
For anyone else who lands here:
The key thing to know is that Ray logs live in /tmp/ray/session_latest/logs in the Ray container.
Lofg processing tools like promtail/Loki can be used to scrape and export these logs.
The docs for the upcoming Ray 2.0.0 will have some guidance on this.