Hi there,
I’m trying to start a ray cluster manually with my local machine as the head node and a server as a worker node.
I this to start my local server:
$ ray start --head --port=6379
Local node IP: <LOCAL_IP>
2021-04-22 17:02:04,277 INFO services.py:1264 -- View the Ray dashboard at http://127.0.0.1:8265
--------------------
Ray runtime started.
--------------------
Next steps
To connect to this Ray runtime from another node, run
ray start --address='<LOCAL_IP>:6379' --redis-password='.....'
Then on my server I ran
$ ray start --address='<HEAD_NODE_REMOTE_IP>:6379' --redis-password='....'
Local node IP: <LOCAL_NODE_IP>
[2021-04-22 17:10:46,570 C 11004 11004] service_based_gcs_client.cc:228: Couldn't reconnect to GCS server. The last attempted GCS server address was <HEAD_NODE_LOCAL_IP>:39103
*** StackTrace Information ***
@ 0x7fad917d0b55 google::GetStackTraceToString()
@ 0x7fad9179f7fe ray::GetCallTrace()
@ 0x7fad917c47c4 ray::SpdLogMessage::Flush()
@ 0x7fad917c493d ray::RayLog::~RayLog()
@ 0x7fad9145f5af ray::gcs::ServiceBasedGcsClient::ReconnectGcsServer()
@ 0x7fad9145f6c5 ray::gcs::ServiceBasedGcsClient::GcsServiceFailureDetected()
@ 0x7fad9145f83b ray::gcs::ServiceBasedGcsClient::PeriodicallyCheckGcsServerAddress()
@ 0x7fad9177db04 ray::PeriodicalRunner::DoRunFnPeriodically()
@ 0x7fad9177e4cf ray::PeriodicalRunner::RunFnPeriodically()
@ 0x7fad91460ffe ray::gcs::ServiceBasedGcsClient::Connect()
@ 0x7fad9131e84d ray::gcs::GlobalStateAccessor::Connect()
@ 0x7fad9125882b __pyx_pw_3ray_7_raylet_19GlobalStateAccessor_3connect()
@ 0x50a561 (unknown)
@ 0x50bf44 _PyEval_EvalFrameDefault
@ 0x507cd4 (unknown)
@ 0x509a00 (unknown)
@ 0x50a3fd (unknown)
@ 0x50bf44 _PyEval_EvalFrameDefault
@ 0x5096c8 (unknown)
@ 0x50a3fd (unknown)
@ 0x50bf44 _PyEval_EvalFrameDefault
@ 0x5096c8 (unknown)
@ 0x50a3fd (unknown)
@ 0x50bf44 _PyEval_EvalFrameDefault
@ 0x507cd4 (unknown)
@ 0x509a00 (unknown)
@ 0x50a3fd (unknown)
@ 0x50cd15 _PyEval_EvalFrameDefault
@ 0x507cd4 (unknown)
@ 0x509a00 (unknown)
@ 0x50a3fd (unknown)
@ 0x50cd15 _PyEval_EvalFrameDefault
The only thing I can spot (bearing in mind I’m completely new to using ray clusters) is that the GCS server is pointing to my local ip (which won’t be connectable from the remote server). Do I need to expose this GCS port 39103 as well? The GCS server is runnon on my local machine
$ lsof -i :39103
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
gcs_serve 73754 jippo 21u IPv6 1300772 0t0 TCP *:39103 (LISTEN)
gcs_serve 73754 jippo 26u IPv6 1300100 0t0 TCP minj.local:39103->minj.local:52456 (ESTABLISHED)
gcs_serve 73754 jippo 29u IPv6 1300395 0t0 TCP minj.local:39103->minj.local:52468 (ESTABLISHED)
gcs_serve 73754 jippo 32u IPv6 1298132 0t0 TCP minj.local:39103->minj.local:52488 (ESTABLISHED)
gcs_serve 73754 jippo 78u IPv6 1406630 0t0 TCP minj.local:39103->minj.local:58434 (ESTABLISHED)
/home/jip 73755 jippo 25u IPv6 1298101 0t0 TCP minj.local:52456->minj.local:39103 (ESTABLISHED)
/home/jip 73771 jippo 26u IPv6 1407754 0t0 TCP minj.local:58434->minj.local:39103 (ESTABLISHED)
/home/jip 73771 jippo 31u IPv6 1300816 0t0 TCP minj.local:52468->minj.local:39103 (ESTABLISHED)
raylet 73804 jippo 14u IPv6 1298131 0t0 TCP minj.local:52488->minj.local:39103 (ESTABLISHED)
Any help on this would be really appreciated
Cheers,
Rory
p.s. related to: Collect samples on a remote server train on local - #2 by sven1977