Couldn't reconnect to GCS server

Hi,

I’m using ray 1.0.0 installed with pip. redis 3.4.1. Python 3.8.5 (all in a conda env)

I have a Ubuntu 20.04.1 LTS server on GCS. I’m connected with a VPN (openvpn) to the server.

I start ray on the server as head. I start my machine as local node.

My VPN client redirects all traffic (all ports) to the server IP through the VPN. The server is pingable and I can reach the dashboard.

Locally, I run ray.init(address=‘auto’, _redis_password=‘XXXXXX’) from a jupyter notebook. The kernel always dies with the message below.

THIS SETUP WAS WORKING LAST FRIDAY.
I have not changed a single line of my config files and startup code. Nor updated any dependencies.

I always get the following message when I try to run ray.init, or even with a simple command such as ray memory (from my local machine)

~$ ray memory
2020-12-21 17:07:11,690 INFO scripts.py:1317 – Connecting to Ray instance at XX.XX.XX.XX:6379.
2020-12-21 17:07:11,691 INFO worker.py:633 – Connecting to existing Ray cluster at address: XX.XX.XX.XX:6379
F1221 17:07:19.591818 7837 7837 service_based_gcs_client.cc:207] Couldn’t reconnect to GCS server. The last attempted GCS server address was XX.XX.XX.XX:41419
*** Check failure stack trace: ***
@ 0x7fbb1900d6ed google::LogMessage::Fail()
@ 0x7fbb1900e84c google::LogMessage::SendToLog()
@ 0x7fbb1900d3c9 google::LogMessage::Flush()
@ 0x7fbb1900d5e1 google::LogMessage::~LogMessage()
@ 0x7fbb18fc4789 ray::RayLog::~RayLog()
@ 0x7fbb18d081ea ray::gcs::ServiceBasedGcsClient::ReconnectGcsServer()
@ 0x7fbb18d082ef ray::gcs::ServiceBasedGcsClient::GcsServiceFailureDetected()
@ 0x7fbb18d08491 ray::gcs::ServiceBasedGcsClient::PeriodicallyCheckGcsServerAddress()
@ 0x7fbb18d0a801 ray::gcs::ServiceBasedGcsClient::Connect()
@ 0x7fbb18c8bed6 ray::CoreWorker::CoreWorker()
@ 0x7fbb18c8fc14 ray::CoreWorkerProcess::CreateWorker()
@ 0x7fbb18c90e82 ray::CoreWorkerProcess::CoreWorkerProcess()
@ 0x7fbb18c9184b ray::CoreWorkerProcess::Initialize()
@ 0x7fbb18bcf448 pyx_pw_3ray_7_raylet_10CoreWorker_1__cinit()
@ 0x7fbb18bd0ba5 __pyx_tp_new_3ray_7_raylet_CoreWorker()
@ 0x55948e1fb6cd _PyObject_MakeTpCall
@ 0x55948e282f56 _PyEval_EvalFrameDefault
@ 0x55948e248f9f _PyEval_EvalCodeWithName
@ 0x55948e249943 _PyFunction_Vectorcall.localalias.355
@ 0x55948e1be11a _PyEval_EvalFrameDefault.cold.2790
@ 0x55948e248f9f _PyEval_EvalCodeWithName
@ 0x55948e249943 _PyFunction_Vectorcall.localalias.355
@ 0x55948e1be11a _PyEval_EvalFrameDefault.cold.2790
@ 0x55948e248a92 _PyEval_EvalCodeWithName
@ 0x55948e249943 _PyFunction_Vectorcall.localalias.355
@ 0x55948e1fb041 PyVectorcall_Call
@ 0x55948e28099b _PyEval_EvalFrameDefault
@ 0x55948e248a92 _PyEval_EvalCodeWithName
@ 0x55948e249943 _PyFunction_Vectorcall.localalias.355
@ 0x55948e249e79 method_vectorcall
@ 0x55948e1fb041 PyVectorcall_Call
@ 0x55948e28099b _PyEval_EvalFrameDefault
Aborted (core dumped)

I do not think the issue is called by Ray directly (for this reason I did not post it as a GH issue), but I’d need some guidance, please.
Why is Ray even calling a GCS-specific function from service_based_gcs_client.cc? I am NOT using Ray code to startup a cluster. I am manually configuring my own Linux server and manually adding my machine as a local node. I can’t see why Ray “knows” that I am using GCS as server provider, nor why it should use a GCS-specific function to connect to a (virtually-)local node.

Many thanks in advance

Hey a few clarifications:

  1. In this context, GCS refers to the global control store, a component of Ray. You can read more about it in the whitepaper..

  2. Commands like ray memory should be run from the head node (the machine you ran ray start --head on).

  3. Unfortunately ray.init(address="auto") only works on the head node, it doesn’t perform a network scan. If you want to connect from a remote machine, you will need to add ray.init(address="xxx.xxx.xxx.xxx") and specify the actually IP address on the head node.

Hi @Alex,

Thanks for the tips. Unfortunately I am still facing exactly the same issue.

If I run ray.init(address=“xxx.xxx.xxx.xxx”) without starting my local machine as a ray node, I get:
Some processes that the driver needs to connect to have not registered with Redis, so retrying. Have you run 'ray start' on this node?

If I then start my local machine as a local node, I get the usual issue.

The weird thing is that, if I now run ray stop locally, and then try to run ray.init, I still get the connection issue, not the previous message.

It looks like something gets stuck…

ps aux | grep ray shows no local ray processes running.

In order to get the previous message again I need to stop/start the head node.

I am really lost…

Thanks