Node started with ssh is lost in a minute

How severe does this issue affect your experience of using Ray?

  • Medium: It contributes to significant difficulty to complete my task, but I can work around it.

I start the cluster head via ray start --head on machine 10.24.10.24, and start a node on 10.24.10.25 via ssh (from 10.24.10.24):

ssh -t 10.24.10.25 "ray start --node 10.24.10.24:6379"

The command works well and ray status shows two active nodes. But after one minute, the second node is missing and ray status only left 1 active node.

The networks are fine. If I login 10.24.10.25 and run ray start --node 10.24.10.24:6379 directly, everything will be fine, the node won’t be lost and the tasks are running well.

Is there any problem when execute ray start from ssh?

more: my servers run on Ubuntu 20.04 with ray 2.10.0.

I solved it with nohup:

ssh server_ip -c "nohup ray start --node --address $ip:$port"

ugly but works. and not know why.

Any process you start from an ssh session (or any shell) becomes a child process to the shell process itself. When the shell process exists, child process gets an OS signal to hang up. Nohup (literally ‘no hang up’) wraps the process preventing it from getting the signal, so it continues to work. You can read more about it in just about any article covering nohup, for example here.

There’s nothing wrong with starting processes from remote shell, but they will not restart if the remote machine itself gets restarted or if the process itself dies, for example, runs out of memory. To what extent it’s a problem for you depends on your needs. There are many ways to automatically start a process when the OS starts, again, depending on your needs and what OS you are using.

1 Like

I’m not sure about:

Any process you start from an ssh session (or any shell) becomes a child process to the shell process itself. When the shell process exists, child process gets an OS signal to hang up.

here is my test:

If I start the ray from shell , close and log out the shell, the ray won’t loss.