Job logs deleted once worker pods are scaled down

Hi!

I’ve set up an autoscaling Ray cluster on Kubernetes. Once a worker gets scaled down, all logs of the job it has executed are deleted and fail to load in the dashboard. While the worker is still active, all logs are shown in the dashboard. How can I make these logs persist in the dashboard even after the worker pods get scaled down?

I’ve looked into the dashboard.log file. Once I open the jobs in my dashboard, the logfile shows the following error.

2024-08-12 09:26:03,139 ERROR state_head.py:442 -- Agent for node id: 55d55be6c8aac4448383b3261a479aa01a10d6319658f67c1cf43ad1 doesn't exist.
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/dashboard/modules/state/state_head.py", line 430, in get_logs
    async for logs_in_bytes in self._log_api.stream_logs(options):
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/dashboard/modules/log/log_manager.py", line 113, in stream_logs
    stream = await self.client.stream_log(
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/util/state/state_manager.py", line 92, in api_with_network_error_handler
    return await func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/util/state/state_manager.py", line 466, in stream_log
    raise ValueError(f"Agent for node id: {node_id} doesn't exist.")
ValueError: Agent for node id: 55d55be6c8aac4448383b3261a479aa01a10d6319658f67c1cf43ad1 doesn't exist.

Edit: related issue - How to retrieve a dead node logs - Monitoring & Debugging - Ray . Is this not supported / planned? :frowning:

I’d really apprectiate some help with this issue!

we have plans to handle this scenario via an export api ray users can explicitly call. see here: [REP] Ray Export API. by MissiontoMars · Pull Request #55 · ray-project/enhancements · GitHub

1 Like