I wrote ray start --head and ray status in a bash script. And srun the script, so the two commands were ran in the same node. And finally, I even could ran ray stopto stop ran process.
(Btw, sometime the dashboard works, but sometimes it doesn’t. )
Hi @zyc-bit, thanks for your question! Could you provide the relevant section of the bash script? It is interesting that ray stop works but ray status doesn’t work.
As for the dashboard only working sometimes, could you provide more details? Can you retry and get the dashboard, or do you have to redeploy?
Hi @cade , thank you very much for your reply.
Let me describe in detail how the above situation happened. ( I am working with a project called Alpa which requires ray )
I install Ray by pip install ray. On the cluster I am using, there is one management node and many compute nodes. The management node has no GPU, the compute nodes all have GPUs. so I use the compute nodes every time. The command to use the compute nodes is via slurm’s srun command.
the srun command sends my test_intall.sh script to the compute node to be executed, which reads as follows. test_install.sh:
ray start --head
ray status
echo "now running python script"
XLA_FLAGS="--xla_gpu_cuda_data_dir=/mnt/cache/share/platform/dep/cuda11.2-cudnn8.1.1" python /mnt/cache/zhangyuchang/alpa-project/alpa/tests/test_install.py
ray stop
The command in the first line ray start --head executes normally, but the command ray status in the second line fails to execute and returns an error message like the one I posted above. The python script in the fourth line also reports an error because there is no ray instance. But finally ray stop shows stop successful.
As for the dashboard issue, as of now, I’m not quite sure what it is and when I would use it myself. I’ve just noticed that a lot of times it will fail to start, but sometimes it will occasionally succeed. Maybe this has something to do with the cluster’s network? Dashboard errors is shown blow:
Usage stats collection will be enabled by default in the next release. See https://github.com/ray-project/ray/issues/20857 for more details.
2022-05-25 09:53:03,159 ERROR services.py:1474 -- Failed to start the dashboard: Failed to start the dashboard
The last 10 lines of /tmp/ray/session_2022-05-25_09-52-35_418229_12783/logs/dashboard.log:
2022-05-25 09:52:54,249 INFO utils.py:99 -- Get all modules by type: DashboardHeadModule
2022-05-25 09:53:03,159 ERROR services.py:1475 -- Failed to start the dashboard
The last 10 lines of /tmp/ray/session_2022-05-25_09-52-35_418229_12783/logs/dashboard.log:
2022-05-25 09:52:54,249 INFO utils.py:99 -- Get all modules by type: DashboardHeadModule
Traceback (most recent call last):
File "/mnt/lustre/zhangyuchang/.conda/envs/0524alpa/lib/python3.7/site-packages/ray/_private/services.py", line 1451, in start_dashboard
raise Exception(err_msg + last_log_str)
Exception: Failed to start the dashboard
The last 10 lines of /tmp/ray/session_2022-05-25_09-52-35_418229_12783/logs/dashboard.log:
2022-05-25 09:52:54,249 INFO utils.py:99 -- Get all modules by type: DashboardHeadModule
And sometime, it works:
That’s the whole story of how this problem happened.
Hi, cade
Strictly speaking, I didn’t solve this problem completely. Occasionally there are times when I still run into this problem. When I run into this problem, I close the terminal and start over again, and it’s usually solved. I myself think that this may be caused by my use of Slurm.
well… It happended again just now. And I ray stop , then closed the terminal and started over again, it works.
On our Slurm, we have a management node and a compute node, the management node has no GPU, the compute node has GPU. on the management node I can start dashboard, but on the compute node I cannot start dashboard
But since only the compute nodes have GPUs, I need to use the compute nodes as Ray head nodes as well as Ray worker nodes, and in this case, I can’t start dashboard.
Does the node on which you are running the slurm script (ray start --head and ray status) have multiple network interfaces? I am guessing that would cause this issue. (this is being worked on here)
To verify that this is the case, can you reproduce the issue (so ray status crashes), then check each IP address the node has to see if the dashboard is available at port :8265? If you want help working on this, contact me on the Ray slack @cadedaniel and we can work out a time to walk through this. The hypothesis is that the Ray dashboard server is bound to an IP that isn’t what the ray client expects.