1. Severity of the issue: (select one)
None: I’m just curious or want clarification.
Low: Annoying but doesn’t hinder my work.
Medium: Significantly affects my productivity but can find a workaround.
High: Completely blocks me.
2. Environment:
- Ray version: v2.44.0
- Python version: 3.11
- OS: Ubuntu 22.04 via KubeRay deployed in GKE cluster
- Cloud/Infrastructure: GCP + GKE
- Other libs/tools (if relevant):
3. Repro steps / sample code: (optional, but helps a lot!)
Head node configured with
headGroupSpec:
rayStartParams:
num-cpus: '0'
to prevent any tasks from being scheduled on it.
Workers are managed by the autoscaler v2, with the default pods number set to 0. All worker pods are using GKE Spot instances.
When a new job is scheduled, Ray cluster creates a new worker pod to run the job entrypoint there. This pod becomes the driver for this job and, by default, collects logs from all other pods that can execute tasks associated with this job.
However, with the autoscaler enabled, the first worker pod will be deleted as soon as it’s no longer in use, and all logs from the job will be deleted along with it. Dashboard will show that it’s not possible to load the logs anymore, which is expected, because the corresponding Ray node is also down.
There is a documentation page that suggests using fluentbit
and similar tools to collect the logs from the running pods and store them in some persistent place like GCP Logging or AWS CloudWatch, but the same page also shows how to view these logs in the external viewer (Loki) and not in the dashboard.
4. What happened vs. what you expected:
- Expected: I want to be able to see job logs after driver pod is terminated
- Actual: I can see logs while job is running, but they disappear after that
I do understand that the current implementation will now allow for this to happen (unless the log manager is rewritten to be pluggable?), but I want to spark some discussion about this topic, as it makes the user experience unpleasant.
Can we expect for Ray to support external log storages in the future?