Dead head nodes selected in scheduling

javiermtorres · February 5, 2025, 8:01am

How severe does this issue affect your experience of using Ray?

Medium: It contributes to significant difficulty to complete my task, but I can work around it.

Hi,

I am currently using Ray in a laptop, and I would like to keep persistent data about jobs so I set up a Redis server to store GCS data. However, I started seeing messages like this when running jobs:

Job supervisor actor could not be scheduled: The actor is not schedulable: The node specified via NodeAffinitySchedulingStrategy doesn't exist any more or is infeasible, and soft=False was specified.

Inspecting the cluster data in the dashboard, I do see that previous executions leave head node records in the Cluster tab. It would seem as if a dead head node is selected because of the scheduling strategy used (as picked in ray/python/ray/dashboard/modules/job/job_manager.py at 75c1469cf03e9e4c32b3f8681223170547b1e397 · ray-project/ray · GitHub).

Is this plausible? If so, is using the RAY_JOB_ALLOW_DRIVER_ON_WORKER_NODES_ENV_VAR to 1 a correct way to address this?
Is there any way of setting a node’s identity, signaling that “this current” head node is the same node as run previously?

Thanks a lot in advance,

Javier

jjyao · February 16, 2025, 5:33am

Hi @javiermtorres, currently GCS FT only supports Ray serve and Kuberay RayServices. It doesn’t support Jobs yet: GCS Fault Tolerance — Ray 2.42.1

Topic		Replies	Views
Head node failed to connect to all its worker nodes Ray Clusters	1	990	October 6, 2023
Actor not being created randomly due to missing resource Ray Core	7	572	August 30, 2022
Understanding resource requirement for tasks and actors Ray Core	1	314	July 17, 2023
Job API is very slow when using external redis	3	326	September 26, 2023
High availability for Ray Serve in 2022 (head node) Ray Serve	3	1369	September 1, 2022

Dead head nodes selected in scheduling

Related topics