Job logs deleted once worker pods are scaled down

stwerner97 · August 12, 2024, 2:31pm

Hi!

I’ve set up an autoscaling Ray cluster on Kubernetes. Once a worker gets scaled down, all logs of the job it has executed are deleted and fail to load in the dashboard. While the worker is still active, all logs are shown in the dashboard. How can I make these logs persist in the dashboard even after the worker pods get scaled down?

I’ve looked into the dashboard.log file. Once I open the jobs in my dashboard, the logfile shows the following error.

2024-08-12 09:26:03,139 ERROR state_head.py:442 -- Agent for node id: 55d55be6c8aac4448383b3261a479aa01a10d6319658f67c1cf43ad1 doesn't exist.
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/dashboard/modules/state/state_head.py", line 430, in get_logs
    async for logs_in_bytes in self._log_api.stream_logs(options):
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/dashboard/modules/log/log_manager.py", line 113, in stream_logs
    stream = await self.client.stream_log(
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/util/state/state_manager.py", line 92, in api_with_network_error_handler
    return await func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/util/state/state_manager.py", line 466, in stream_log
    raise ValueError(f"Agent for node id: {node_id} doesn't exist.")
ValueError: Agent for node id: 55d55be6c8aac4448383b3261a479aa01a10d6319658f67c1cf43ad1 doesn't exist.

Edit: related issue - How to retrieve a dead node logs - Monitoring & Debugging - Ray . Is this not supported / planned?

I’d really apprectiate some help with this issue!

Sam_Chan · August 13, 2024, 5:09am

we have plans to handle this scenario via an export api ray users can explicitly call. see here: [REP] Ray Export API. by MissiontoMars · Pull Request #55 · ray-project/enhancements · GitHub

Topic		Replies	Views
A way to show GCP logs in the dashboard? Ray Clusters	1	34	July 31, 2025
Finding worker logs on (auto)scaled down kubernetes nodes / using shared temp_dir Dashboard, Monitoring & Debugging	4	949	July 12, 2021
ray.exceptions.WorkerCrashedError Kubernetes	9	2045	August 22, 2022
How to retrieve a dead node logs Dashboard, Monitoring & Debugging	3	780	August 13, 2024
Persist ray job logs after restarting cluster Ray Clusters	0	510	November 17, 2023

Job logs deleted once worker pods are scaled down

Related topics