Ray head node stops responding

How severe does this issue affect your experience of using Ray?

  • Medium: It contributes to significant difficulty to complete my task, but I can work around it.

Hello,

Have a Ray cluster (v2.9 + python3.8) running on AKS that over time (usually a few weeks) stops responding to job submissions.
In that state:

  • The head node logs stop updating.
  • Running ray status within the cluster perimeter stops updating.
  • Auto-scaler stops working, no scale up/down actions are performed.
  • The cluster “accepts” jobs but they get stuck in the “Pending” state forever. The job status returns Job has not started yet. It may be waiting for resources (CPUs, GPUs, memory, custom resources) to become available. It may be waiting for the runtime environment to be set up.. Those jobs generate no logs however.
  • The operator pod seems to continue working OK (logs updating etc.).

Restarting the head node and re-submitting the job works as a workaround but is quite disruptive.

Some more info:

Memory seems OK
--- Aggregate object store stats across all nodes --- Plasma memory usage 0 MiB, 1 objects, 0.0% full, 0.0% needed Objects consumed by Ray tasks: 0 MiB.

Timeline export/import fails on chrome, w/:
Error: Couldn't create an importer for the provided eventData....

Any suggestions on how to debug the issue would be much appreciated. Thank you!

Can you show a snapshot of your Ray Dashboard of this long running Ray Cluster. Also curious as to why you need it to be long running (instead of a fresh Cluster on new Job creation).

Regarding UI screencap, I will have to wait for the cluster to get into that state to provide those so will have to report back later. Also, are there any more information bits I could provide for when it happens?

Regarding long running approach: I have a small (CPU-only) head node running that accepts jobs from multiple streams/sources and scales GPU workers up/down. In terms of infra management this was easier than asking everyone to maintain their own (“on demand”) cluster instances. Apart for the issue described above, it works out well in terms of resources (costs/time). That said, do let me know if there’s some best practice references that I missed!

Got the cluster stuck again, here are some screenshots:

Cluster tab:

Jobs tab:


The last two jobs are in the “stuck” state:

  • The one in “Running” state was an “interactive” job via ray.init(...)
  • The one in “Pending” state was submitted via ray job submit ...
  • Neither is executing the code/triggering any autoscaler actions.

Stuck job:

Running ray jobs status... on a stuck job returns: Job has not started yet. It may be waiting for resources (CPUs, GPUs, memory, custom resources) to become available. It may be waiting for the runtime environment to be set up.

However, head node logs on Kubernetes have stopped updating about 20hrs ago w/ the last entry being:

│ autoscaler Resources                                                                                                                                                                                                                                                                                          │
│ autoscaler ---------------------------------------------------------------                                                                                                                                                                                                                                    │
│ autoscaler Usage:                                                                                                                                                                                                                                                                                             │
│ autoscaler  0.0/5.0 CPU                                                                                                                                                                                                                                                                                       │
│ autoscaler  0B/16.76GiB memory                                                                                                                                                                                                                                                                                │
│ autoscaler  0B/4.86GiB object_store_memory                                                                                                                                                                                                                                                                    │
│ autoscaler                                                                                                                                                                                                                                                                                                    │
│ autoscaler Demands:                                                                                                                                                                                                                                                                                           │
│ autoscaler  (no resource demands)                                                                                                                                                                                                                                                                             │
│ autoscaler 2024-09-26 13:20:24,394    INFO autoscaler.py:469 -- The autoscaler took 0.114 seconds to complete the update iteration.                                                                                                                                                                           │
│ autoscaler 2024-09-26 13:20:29,524    INFO node_provider.py:240 -- Listing pods for RayCluster raycluster-kuberay in namespace ray at pods resource version >= 344523484.                                                                                                                                     │
│ autoscaler 2024-09-26 13:20:29,567    INFO node_provider.py:258 -- Fetched pod data at resource version 344638677.                                                                                                                                                                                            │
│ autoscaler 2024-09-26 13:20:29,567    INFO autoscaler.py:146 -- The autoscaler took 0.113 seconds to fetch the list of non-terminated nodes.                                                                                                                                                                  │
│ autoscaler 2024-09-26 13:20:29,568    INFO autoscaler.py:426 --                                                                                                                                                                                                                                               │
│ autoscaler ======== Autoscaler status: 2024-09-26 13:20:29.568064 ========                                                                                                                                                                                                                                    │
│ autoscaler Node status                                                                                                                                                                                                                                                                                        │
│ autoscaler ---------------------------------------------------------------                                                                                                                                                                                                                                    │
│ autoscaler Active:                                                                                                                                                                                                                                                                                            │
│ autoscaler  1 head-group                                                                                                                                                                                                                                                                                      │
│ autoscaler  1 CPUx1.2GBRAM                                                                                                                                                                                                                                                                                    │
│ autoscaler Pending:                                                                                                                                                                                                                                                                                           │
│ autoscaler  (no pending nodes)                                                                                                                                                                                                                                                                                │
│ autoscaler Recent failures:                                                                                                                                                                                                                                                                                   │
│ autoscaler  (no failures)                                                                                                                                                                                                                                                                                     │
│ autoscaler                                                                                                                                                                                                                                                                                                    │
│ autoscaler Resources                                                                                                                                                                                                                                                                                          │
│ autoscaler ---------------------------------------------------------------                                                                                                                                                                                                                                    │
│ autoscaler Usage:                                                                                                                                                                                                                                                                                             │
│ autoscaler  0.0/5.0 CPU                                                                                                                                                                                                                                                                                       │
│ autoscaler  0B/16.76GiB memory                                                                                                                                                                                                                                                                                │
│ autoscaler  0B/4.86GiB object_store_memory                                                                                                                                                                                                                                                                    │
│ autoscaler                                                                                                                                                                                                                                                                                                    │
│ autoscaler Demands:                                                                                                                                                                                                                                                                                           │
│ autoscaler  (no resource demands)                                                                                                                                                                                                                                                                             │
│ autoscaler 2024-09-26 13:20:29,568    INFO autoscaler.py:469 -- The autoscaler took 0.114 seconds to complete the update iteration.

Running ray status in this state returns the same state reported in Kubernetes logs (which is ~20 odd hours old).

FWIW, I updated to Ray 2.36 / Python 3.10 / Kuberay 1.2.2 and the issue remains.