How severe does this issue affect your experience of using Ray?
- High: It blocks me to complete my task.
Hi, I am trying to set up a ray cluster but it seems to be hanging at some point in the process. The ray up cluster.yaml
command seems to run perfectly fine but then one of two things happen:
ray status
returnsNo cluster status. It may take a few seconds for the Ray internal services to start up.
and the cluster only launches the head node.- The status command returns:
Node status
---------------------------------------------------------------
Active:
1 local.cluster.node
Pending:
local.cluster.node, 7 launching
Recent failures:
(no failures)
Resources
---------------------------------------------------------------
Usage:
0.0/16.0 CPU
0.0/1.0 GPU
0B/36.23GiB memory
0B/18.12GiB object_store_memory`
The dashboard seems to work fine in both cases, i could not find anything in the logs.
Also, whatever happens, the ray monitor
command does not work and hangs, ray down
hangs too until i run ray stop --force
.
My cluster yaml file is basically the example one with no docker. All my machines use ubuntu and a conda base environment in python 3.12.8 and a 2.42 ray version.
I have sometimes been able to launch a cluster with only 2 workers instead of 10 but that was not reproducible.
Edit:
I have tested launching it in two steps:
ray up cluster.yaml
→ The satus shows launching forever (the nodes do not get launched)- Manually launching the nodes using ray start directly in the machines. This works, i can see the new nodes on the dashboard but ray status still shows the ‘launching state’ and does not show new nodes wether they were in the initial config file or not.