Dead head nodes selected in scheduling

How severe does this issue affect your experience of using Ray?

  • Medium: It contributes to significant difficulty to complete my task, but I can work around it.

Hi,

I am currently using Ray in a laptop, and I would like to keep persistent data about jobs so I set up a Redis server to store GCS data. However, I started seeing messages like this when running jobs:

Job supervisor actor could not be scheduled: The actor is not schedulable: The node specified via NodeAffinitySchedulingStrategy doesn't exist any more or is infeasible, and soft=False was specified.

Inspecting the cluster data in the dashboard, I do see that previous executions leave head node records in the Cluster tab. It would seem as if a dead head node is selected because of the scheduling strategy used (as picked in ray/python/ray/dashboard/modules/job/job_manager.py at 75c1469cf03e9e4c32b3f8681223170547b1e397 · ray-project/ray · GitHub).

  1. Is this plausible? If so, is using the RAY_JOB_ALLOW_DRIVER_ON_WORKER_NODES_ENV_VAR to 1 a correct way to address this?
  2. Is there any way of setting a node’s identity, signaling that “this current” head node is the same node as run previously?

Thanks a lot in advance,

Javier