Unable to manually start ray cluster

Hi there,

I’m trying to start a ray cluster manually with my local machine as the head node and a server as a worker node.

I this to start my local server:

$ ray start --head --port=6379
Local node IP: <LOCAL_IP>
2021-04-22 17:02:04,277 INFO services.py:1264 -- View the Ray dashboard at http://127.0.0.1:8265

--------------------
Ray runtime started.
--------------------

Next steps
  To connect to this Ray runtime from another node, run
    ray start --address='<LOCAL_IP>:6379' --redis-password='.....'

Then on my server I ran

$ ray start --address='<HEAD_NODE_REMOTE_IP>:6379' --redis-password='....'
Local node IP: <LOCAL_NODE_IP>
[2021-04-22 17:10:46,570 C 11004 11004] service_based_gcs_client.cc:228: Couldn't reconnect to GCS server. The last attempted GCS server address was <HEAD_NODE_LOCAL_IP>:39103
*** StackTrace Information ***
    @     0x7fad917d0b55  google::GetStackTraceToString()
    @     0x7fad9179f7fe  ray::GetCallTrace()
    @     0x7fad917c47c4  ray::SpdLogMessage::Flush()
    @     0x7fad917c493d  ray::RayLog::~RayLog()
    @     0x7fad9145f5af  ray::gcs::ServiceBasedGcsClient::ReconnectGcsServer()
    @     0x7fad9145f6c5  ray::gcs::ServiceBasedGcsClient::GcsServiceFailureDetected()
    @     0x7fad9145f83b  ray::gcs::ServiceBasedGcsClient::PeriodicallyCheckGcsServerAddress()
    @     0x7fad9177db04  ray::PeriodicalRunner::DoRunFnPeriodically()
    @     0x7fad9177e4cf  ray::PeriodicalRunner::RunFnPeriodically()
    @     0x7fad91460ffe  ray::gcs::ServiceBasedGcsClient::Connect()
    @     0x7fad9131e84d  ray::gcs::GlobalStateAccessor::Connect()
    @     0x7fad9125882b  __pyx_pw_3ray_7_raylet_19GlobalStateAccessor_3connect()
    @           0x50a561  (unknown)
    @           0x50bf44  _PyEval_EvalFrameDefault
    @           0x507cd4  (unknown)
    @           0x509a00  (unknown)
    @           0x50a3fd  (unknown)
    @           0x50bf44  _PyEval_EvalFrameDefault
    @           0x5096c8  (unknown)
    @           0x50a3fd  (unknown)
    @           0x50bf44  _PyEval_EvalFrameDefault
    @           0x5096c8  (unknown)
    @           0x50a3fd  (unknown)
    @           0x50bf44  _PyEval_EvalFrameDefault
    @           0x507cd4  (unknown)
    @           0x509a00  (unknown)
    @           0x50a3fd  (unknown)
    @           0x50cd15  _PyEval_EvalFrameDefault
    @           0x507cd4  (unknown)
    @           0x509a00  (unknown)
    @           0x50a3fd  (unknown)
    @           0x50cd15  _PyEval_EvalFrameDefault

The only thing I can spot (bearing in mind I’m completely new to using ray clusters) is that the GCS server is pointing to my local ip (which won’t be connectable from the remote server). Do I need to expose this GCS port 39103 as well? The GCS server is runnon on my local machine

$ lsof -i :39103  
COMMAND     PID  USER   FD   TYPE  DEVICE SIZE/OFF NODE NAME
gcs_serve 73754 jippo   21u  IPv6 1300772      0t0  TCP *:39103 (LISTEN)
gcs_serve 73754 jippo   26u  IPv6 1300100      0t0  TCP minj.local:39103->minj.local:52456 (ESTABLISHED)
gcs_serve 73754 jippo   29u  IPv6 1300395      0t0  TCP minj.local:39103->minj.local:52468 (ESTABLISHED)
gcs_serve 73754 jippo   32u  IPv6 1298132      0t0  TCP minj.local:39103->minj.local:52488 (ESTABLISHED)
gcs_serve 73754 jippo   78u  IPv6 1406630      0t0  TCP minj.local:39103->minj.local:58434 (ESTABLISHED)
/home/jip 73755 jippo   25u  IPv6 1298101      0t0  TCP minj.local:52456->minj.local:39103 (ESTABLISHED)
/home/jip 73771 jippo   26u  IPv6 1407754      0t0  TCP minj.local:58434->minj.local:39103 (ESTABLISHED)
/home/jip 73771 jippo   31u  IPv6 1300816      0t0  TCP minj.local:52468->minj.local:39103 (ESTABLISHED)
raylet    73804 jippo   14u  IPv6 1298131      0t0  TCP minj.local:52488->minj.local:39103 (ESTABLISHED)

Any help on this would be really appreciated :slight_smile:

Cheers,

Rory

p.s. related to: Collect samples on a remote server train on local - #2 by sven1977

Okay so I’m able to get it running on a local network. I guess that means it’s something to do with not having the right ports exposed. All I could tell from the docs is that you need the redis port open, which it is. Can anyone point me in the right direction on this?

I sympathize with you, networks are challenging. Your question does land one in a big world of networking “what about X” questions. To me, running a ray cluster that spans across networks sounds like more trouble than its worth. I would persue putting all the ray nodes in the same place.