“ray start --head” succeed but "ray status" could not find any running ray instance

How severe does this issue affect your experience of using Ray?

  • High: It blocks me to complete my task.

I slurm ran ray start --head, and the slurm gave me a node and showed the ray has started. As the picture blow shows:

but then I ran ray status, it showed me blow:

I wrote ray start --head and ray status in a bash script. And srun the script, so the two commands were ran in the same node. And finally, I even could ran ray stopto stop ran process.

(Btw, sometime the dashboard works, but sometimes it doesn’t. )

Can anyone help me?

Hi @zyc-bit, thanks for your question! Could you provide the relevant section of the bash script? It is interesting that ray stop works but ray status doesn’t work.

As for the dashboard only working sometimes, could you provide more details? Can you retry and get the dashboard, or do you have to redeploy?

Hi @cade , thank you very much for your reply.
Let me describe in detail how the above situation happened. ( I am working with a project called Alpa which requires ray )
I install Ray by pip install ray. On the cluster I am using, there is one management node and many compute nodes. The management node has no GPU, the compute nodes all have GPUs. so I use the compute nodes every time. The command to use the compute nodes is via slurm’s srun command.

TF_CPP_MIN_LOG_LEVEL=0 XLA_FLAGS="--xla_gpu_cuda_data_dir=/mnt/cache/share/platform/dep/cuda11.2-cudnn8.1.1" srun -p caif_dev --gres=gpu:1 -n1 bash test_install.sh

the srun command sends my test_intall.sh script to the compute node to be executed, which reads as follows. test_install.sh:

ray start --head
ray status
echo "now running python script"
XLA_FLAGS="--xla_gpu_cuda_data_dir=/mnt/cache/share/platform/dep/cuda11.2-cudnn8.1.1" python /mnt/cache/zhangyuchang/alpa-project/alpa/tests/test_install.py
ray stop

The command in the first line ray start --head executes normally, but the command ray status in the second line fails to execute and returns an error message like the one I posted above. The python script in the fourth line also reports an error because there is no ray instance. But finally ray stop shows stop successful.

As for the dashboard issue, as of now, I’m not quite sure what it is and when I would use it myself. I’ve just noticed that a lot of times it will fail to start, but sometimes it will occasionally succeed. Maybe this has something to do with the cluster’s network? Dashboard errors is shown blow:

Usage stats collection will be enabled by default in the next release. See https://github.com/ray-project/ray/issues/20857 for more details.
2022-05-25 09:53:03,159 ERROR services.py:1474 -- Failed to start the dashboard: Failed to start the dashboard
 The last 10 lines of /tmp/ray/session_2022-05-25_09-52-35_418229_12783/logs/dashboard.log:
2022-05-25 09:52:54,249 INFO utils.py:99 -- Get all modules by type: DashboardHeadModule

2022-05-25 09:53:03,159 ERROR services.py:1475 -- Failed to start the dashboard
 The last 10 lines of /tmp/ray/session_2022-05-25_09-52-35_418229_12783/logs/dashboard.log:
2022-05-25 09:52:54,249 INFO utils.py:99 -- Get all modules by type: DashboardHeadModule
Traceback (most recent call last):
  File "/mnt/lustre/zhangyuchang/.conda/envs/0524alpa/lib/python3.7/site-packages/ray/_private/services.py", line 1451, in start_dashboard
    raise Exception(err_msg + last_log_str)
Exception: Failed to start the dashboard
 The last 10 lines of /tmp/ray/session_2022-05-25_09-52-35_418229_12783/logs/dashboard.log:
2022-05-25 09:52:54,249 INFO utils.py:99 -- Get all modules by type: DashboardHeadModule

And sometime, it works:
image

That’s the whole story of how this problem happened.

Hi @cade , do you know the error blow?

Hi @zyc-bit, I’m sorry I dropped this! Were you able to get past the issue? Is it related to your other more recent post Ray dashboard can not start?

or maybe Raylet errors some worker have not registered within the timeout ?

Hi, cade
Strictly speaking, I didn’t solve this problem completely. Occasionally there are times when I still run into this problem. When I run into this problem, I close the terminal and start over again, and it’s usually solved. I myself think that this may be caused by my use of Slurm.

well… It happended again just now. And I ray stop , then closed the terminal and started over again, it works.

The problem about dashboard is not solved.

On our Slurm, we have a management node and a compute node, the management node has no GPU, the compute node has GPU. on the management node I can start dashboard, but on the compute node I cannot start dashboard

But since only the compute nodes have GPUs, I need to use the compute nodes as Ray head nodes as well as Ray worker nodes, and in this case, I can’t start dashboard.

This issue still blocks me.

Does the node on which you are running the slurm script (ray start --head and ray status) have multiple network interfaces? I am guessing that would cause this issue. (this is being worked on here)

To verify that this is the case, can you reproduce the issue (so ray status crashes), then check each IP address the node has to see if the dashboard is available at port :8265? If you want help working on this, contact me on the Ray slack @cadedaniel and we can work out a time to walk through this. The hypothesis is that the Ray dashboard server is bound to an IP that isn’t what the ray client expects.

Yep, let’s work on that in the separate question.

Thank you cade for helping me.

Yes, the node does has multiple network interfaces.

I will try to follow your hints. And if I reproduce the issue, I will ping you in Ray slack.

@zyc-bit did it work? and can you mark as solved if it did?

1 Like

yeah, thanks for reminding me.