“ray start --head” succeed but "ray status" could not find any running ray instance

zyc-bit · May 24, 2022, 5:02pm

How severe does this issue affect your experience of using Ray?

High: It blocks me to complete my task.

I slurm ran ray start --head, and the slurm gave me a node and showed the ray has started. As the picture blow shows:

but then I ran ray status, it showed me blow:

I wrote ray start --head and ray status in a bash script. And srun the script, so the two commands were ran in the same node. And finally, I even could ran ray stopto stop ran process.

(Btw, sometime the dashboard works, but sometimes it doesn’t. )

Can anyone help me?

cade · May 24, 2022, 8:40pm

Hi @zyc-bit, thanks for your question! Could you provide the relevant section of the bash script? It is interesting that ray stop works but ray status doesn’t work.

As for the dashboard only working sometimes, could you provide more details? Can you retry and get the dashboard, or do you have to redeploy?

zyc-bit · May 25, 2022, 1:51am

Hi @cade , thank you very much for your reply.
Let me describe in detail how the above situation happened. ( I am working with a project called Alpa which requires ray )
I install Ray by pip install ray. On the cluster I am using, there is one management node and many compute nodes. The management node has no GPU, the compute nodes all have GPUs. so I use the compute nodes every time. The command to use the compute nodes is via slurm’s srun command.

TF_CPP_MIN_LOG_LEVEL=0 XLA_FLAGS="--xla_gpu_cuda_data_dir=/mnt/cache/share/platform/dep/cuda11.2-cudnn8.1.1" srun -p caif_dev --gres=gpu:1 -n1 bash test_install.sh

the srun command sends my test_intall.sh script to the compute node to be executed, which reads as follows. test_install.sh:

ray start --head
ray status
echo "now running python script"
XLA_FLAGS="--xla_gpu_cuda_data_dir=/mnt/cache/share/platform/dep/cuda11.2-cudnn8.1.1" python /mnt/cache/zhangyuchang/alpa-project/alpa/tests/test_install.py
ray stop

The command in the first line ray start --head executes normally, but the command ray status in the second line fails to execute and returns an error message like the one I posted above. The python script in the fourth line also reports an error because there is no ray instance. But finally ray stop shows stop successful.

As for the dashboard issue, as of now, I’m not quite sure what it is and when I would use it myself. I’ve just noticed that a lot of times it will fail to start, but sometimes it will occasionally succeed. Maybe this has something to do with the cluster’s network? Dashboard errors is shown blow:

Usage stats collection will be enabled by default in the next release. See https://github.com/ray-project/ray/issues/20857 for more details.
2022-05-25 09:53:03,159 ERROR services.py:1474 -- Failed to start the dashboard: Failed to start the dashboard
 The last 10 lines of /tmp/ray/session_2022-05-25_09-52-35_418229_12783/logs/dashboard.log:
2022-05-25 09:52:54,249 INFO utils.py:99 -- Get all modules by type: DashboardHeadModule

2022-05-25 09:53:03,159 ERROR services.py:1475 -- Failed to start the dashboard
 The last 10 lines of /tmp/ray/session_2022-05-25_09-52-35_418229_12783/logs/dashboard.log:
2022-05-25 09:52:54,249 INFO utils.py:99 -- Get all modules by type: DashboardHeadModule
Traceback (most recent call last):
  File "/mnt/lustre/zhangyuchang/.conda/envs/0524alpa/lib/python3.7/site-packages/ray/_private/services.py", line 1451, in start_dashboard
    raise Exception(err_msg + last_log_str)
Exception: Failed to start the dashboard
 The last 10 lines of /tmp/ray/session_2022-05-25_09-52-35_418229_12783/logs/dashboard.log:
2022-05-25 09:52:54,249 INFO utils.py:99 -- Get all modules by type: DashboardHeadModule

And sometime, it works:

That’s the whole story of how this problem happened.

zyc-bit · May 25, 2022, 2:47am

Hi @cade , do you know the error blow?

cade · June 15, 2022, 9:33pm

Hi @zyc-bit, I’m sorry I dropped this! Were you able to get past the issue? Is it related to your other more recent post Ray dashboard can not start?

cade · June 15, 2022, 9:34pm

or maybe Raylet errors some worker have not registered within the timeout ?

zyc-bit · June 16, 2022, 1:03am

Hi, cade
Strictly speaking, I didn’t solve this problem completely. Occasionally there are times when I still run into this problem. When I run into this problem, I close the terminal and start over again, and it’s usually solved. I myself think that this may be caused by my use of Slurm.

well… It happended again just now. And I ray stop , then closed the terminal and started over again, it works.

zyc-bit · June 16, 2022, 1:12am

The problem about dashboard is not solved.

On our Slurm, we have a management node and a compute node, the management node has no GPU, the compute node has GPU. on the management node I can start dashboard, but on the compute node I cannot start dashboard

But since only the compute nodes have GPUs, I need to use the compute nodes as Ray head nodes as well as Ray worker nodes, and in this case, I can’t start dashboard.

zyc-bit · June 16, 2022, 1:16am

This issue still blocks me.

cade · June 21, 2022, 7:00pm

Does the node on which you are running the slurm script (ray start --head and ray status) have multiple network interfaces? I am guessing that would cause this issue. (this is being worked on here)

To verify that this is the case, can you reproduce the issue (so ray status crashes), then check each IP address the node has to see if the dashboard is available at port :8265? If you want help working on this, contact me on the Ray slack @cadedaniel and we can work out a time to walk through this. The hypothesis is that the Ray dashboard server is bound to an IP that isn’t what the ray client expects.

cade · June 21, 2022, 7:00pm

Yep, let’s work on that in the separate question.

zyc-bit · June 22, 2022, 6:32am

Thank you cade for helping me.

Yes, the node does has multiple network interfaces.

I will try to follow your hints. And if I reproduce the issue, I will ping you in Ray slack.

Alex · June 29, 2022, 7:33pm

@zyc-bit did it work? and can you mark as solved if it did?

zyc-bit · July 1, 2022, 7:37am

yeah, thanks for reminding me.

Topic		Replies	Views
Ray dashboard can not start Ray Core	17	2032	February 14, 2024
Ray head isn't starting properly sometimes Ray Core	7	525	April 28, 2023
Ray status couldn't find the Ray cluster despite a working cluster Ray Clusters	2	1137	October 26, 2022
Ray dashboard won't start Dashboard, Monitoring & Debugging	1	189	February 20, 2025
(raylet) Some workers of the worker process(68497) have not registered within the timeout. The process is still alive, probably it's hanging during start Ray Clusters	4	2532	May 26, 2022

“ray start --head” succeed but "ray status" could not find any running ray instance

Related topics